Specialty Store Marketing Campaign Analysis¶

By: Lincoln Berbert

The Context:¶

In today's competitive business environment, understanding customer behavior and preferences is crucial for success. Companies strive to identify meaningful customer segments in order to tailor their products, services, and marketing strategies effectively. This case study focuses on a dataset containing various customer-related features, aiming to discover the best possible customer segments using unsupervised learning techniques, such as dimensionality reduction and clustering. Surveys have shown that up to 80% of participants do business with companies that provide them with a more personalized experience, and that targeted ad campaigns bring in a 77% return on investment.

The dataset under analysis--which is from a higher end specialty goods market business with retail locations, a shipping catalog, and an online web store-- is comprised of diverse features of customers, including demographic, socio-economic, and behavioral data. This rich dataset allows us to examine how customers interact with the company and their purchasing patterns, providing valuable insights for targeted marketing campaigns and personalized customer experiences.

The Objective:¶

The main objective of this study is to leverage unsupervised learning methods, specifically dimensionality reduction and clustering techniques, to identify meaningful customer segments based on the given dataset. By uncovering these segments, the company can better understand its customers, enabling more effective targeting and resource allocation in future marketing efforts. We will then use these segments to provide campaign recommendations and actionable processes for future marketing campaigns.

The Key Questions:¶

  1. What underlying patterns or trends can be identified in the dataset using dimensionality reduction techniques?
  2. What dimensions/features aren't relevant to creating our customer segments, such as features with high collinearity or features with low correlation values.
  3. Can meaningful customer segments be discovered through clustering algorithms?
  4. How do the identified customer segments differ from each other in terms of demographics, socio-economic status, and purchasing behavior?
  5. What is the best clustering algorithm for our dataset and why?
  6. How can the identified customer segments be utilized for more effective marketing strategies and personalized customer experiences?

The Problem Formulation:¶

Given the customer dataset with features such as year of birth, education, marital status, household composition, income, purchasing behavior, and campaign responses, we aim to:

  1. Preprocess the data, handling missing values and outliers, and preparing the dataset for further analysis.
  2. Apply dimensionality reduction techniques such as t-SNE and PCA to visualize the underlying structure of the dataset and identify potential patterns.
  3. Implement clustering algorithms to group customers into distinct segments based on their similarities across the various features. These algorithms will include K-Means, K-Medoids, Hierarchical clustering, DBSCAN, and Gaussian Mixture Models (GMMs).
  4. Analyze and interpret the resulting customer segments, highlighting the key differences and insights that can be used to inform marketing strategies and improve customer engagement.

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Import the necessary libraries and load the data¶

In [1]:
# Libraries for data manipulation and calculations
import pandas as pd
import numpy as np

# Libraries for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style = 'darkgrid')

# Encoding variables
from sklearn.preprocessing import LabelEncoder

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Data scaling
from sklearn.preprocessing import StandardScaler

# Calculating distances
from scipy.spatial.distance import cdist, pdist

# K-Means & silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# K-Medoids, Hierarchical clustering, DBSCAN, and Gaussian Mixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering 
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

# Aditional computations and graphs
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# Supress warnings
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Importing the dataset and creating a copy to not affect the original
marketdata = pd.read_csv("marketing_campaign.csv")
df = marketdata.copy()

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
  • Drop the column which has no null values
In [3]:
# Checking the shape of the data
df.shape
Out[3]:
(2240, 27)

The dataset has 2240 rows and 27 columns

In [4]:
# Checking a few random samples from the data
df.sample(10, random_state = 1)
Out[4]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
779 10736 1971 Graduation Single 72258.0 0 1 12-09-2013 28 522 ... 9 5 2 0 0 0 0 0 0 0
389 9799 1968 PhD Divorced 83664.0 1 1 08-05-2013 57 866 ... 2 12 5 0 0 0 0 0 0 0
510 9925 1981 PhD Together 39665.0 1 0 25-05-2013 97 127 ... 2 3 7 1 0 0 0 0 0 0
1553 7321 1962 Graduation Together 76081.0 0 0 23-05-2014 85 292 ... 5 4 2 0 0 0 1 0 0 0
1172 8278 1990 PhD Married 74214.0 0 0 26-08-2012 3 863 ... 2 5 5 0 0 0 0 0 0 0
1204 2995 1957 Master Together 66636.0 0 0 17-08-2013 64 291 ... 4 9 1 0 0 0 0 0 0 0
2057 1071 1976 PhD Divorced 70179.0 0 1 21-07-2013 10 532 ... 3 13 5 0 0 0 0 0 0 0
1960 8925 1965 Master Married 70053.0 0 1 03-07-2013 38 512 ... 5 10 5 0 0 0 0 0 0 0
615 3083 1974 Graduation Married 45837.0 1 1 26-07-2013 88 215 ... 2 5 7 0 0 0 0 0 0 0
761 6887 1967 Graduation Single 79146.0 1 1 24-04-2014 33 245 ... 1 8 6 0 0 0 0 0 0 0

10 rows × 27 columns

At first glance:

  • There seems to be a lot of people with incomes in the 70,000 dollar - 80,000 dollar range.
  • Not a lot of complaints or campaign responses/acceptance.
  • Store purchases seem to be higher than than catalog purchases.
In [5]:
# checking thte data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
  • Income seems to be the only column with missing values.
  • All the data types are integers except for Dt_Customer, Income, Marital_Status, and Education. This makes sense as dates, marital status, and education level are not numerical values. We can however convert these all (besides date) to numerical values if needed to perform analysis by assigning numbers to each of the responses, but this could cause problems if not done carefully.
  • Income can easily be rounded to an integer if needed.
In [6]:
# Checking for duplicates
df.duplicated().sum()
Out[6]:
0

No duplicates are present.

In [7]:
# Imputing the missing values using the Median as it is less sensitive to outliers than Mean.
median_income = df['Income'].median()
df['Income'].fillna(median_income, inplace = True)

# Checking that the values have been properly imputed
df['Income'].isnull().sum()
Out[7]:
0

There are now no missing values in the income column.

In [8]:
df['ID'].nunique()
Out[8]:
2240
In [9]:
# Removing the ID column from the data
df.drop('ID', axis = 1, inplace = True)

The ID column is a unique identifer for each individual and as such provides no valuable information and can be dropped.

Observations and Insights from the Data overview:¶

  • The data is now prepared and ready for our initial exploratory analysis.
  • We have 26 Columns of data after removing the unecessary ID column.
  • Only Education, Marital_Status, and Dt_Customer are non-numerical columns.

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, will be done.

Questions:

  1. What is the summary statistics of the data? Explore summary statistics for numerical variables and the categorical variables
  2. Find out number of unique observations in each category of categorical columns? Write findings/observations/insights
  3. Are all categories different from each other or can we combine some categories?
In [10]:
# Checking the summary statistics of the data
df.describe(include = 'all').T
Out[10]:
count unique top freq mean std min 25% 50% 75% max
Year_Birth 2240.0 NaN NaN NaN 1968.805804 11.984069 1893.0 1959.0 1970.0 1977.0 1996.0
Education 2240 5 Graduation 1127 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 2240 8 Married 864 NaN NaN NaN NaN NaN NaN NaN
Income 2240.0 NaN NaN NaN 52237.975446 25037.955891 1730.0 35538.75 51381.5 68289.75 666666.0
Kidhome 2240.0 NaN NaN NaN 0.444196 0.538398 0.0 0.0 0.0 1.0 2.0
Teenhome 2240.0 NaN NaN NaN 0.50625 0.544538 0.0 0.0 0.0 1.0 2.0
Dt_Customer 2240 663 31-08-2012 12 NaN NaN NaN NaN NaN NaN NaN
Recency 2240.0 NaN NaN NaN 49.109375 28.962453 0.0 24.0 49.0 74.0 99.0
MntWines 2240.0 NaN NaN NaN 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 NaN NaN NaN 26.302232 39.773434 0.0 1.0 8.0 33.0 199.0
MntMeatProducts 2240.0 NaN NaN NaN 166.95 225.715373 0.0 16.0 67.0 232.0 1725.0
MntFishProducts 2240.0 NaN NaN NaN 37.525446 54.628979 0.0 3.0 12.0 50.0 259.0
MntSweetProducts 2240.0 NaN NaN NaN 27.062946 41.280498 0.0 1.0 8.0 33.0 263.0
MntGoldProds 2240.0 NaN NaN NaN 44.021875 52.167439 0.0 9.0 24.0 56.0 362.0
NumDealsPurchases 2240.0 NaN NaN NaN 2.325 1.932238 0.0 1.0 2.0 3.0 15.0
NumWebPurchases 2240.0 NaN NaN NaN 4.084821 2.778714 0.0 2.0 4.0 6.0 27.0
NumCatalogPurchases 2240.0 NaN NaN NaN 2.662054 2.923101 0.0 0.0 2.0 4.0 28.0
NumStorePurchases 2240.0 NaN NaN NaN 5.790179 3.250958 0.0 3.0 5.0 8.0 13.0
NumWebVisitsMonth 2240.0 NaN NaN NaN 5.316518 2.426645 0.0 3.0 6.0 7.0 20.0
AcceptedCmp3 2240.0 NaN NaN NaN 0.072768 0.259813 0.0 0.0 0.0 0.0 1.0
AcceptedCmp4 2240.0 NaN NaN NaN 0.074554 0.262728 0.0 0.0 0.0 0.0 1.0
AcceptedCmp5 2240.0 NaN NaN NaN 0.072768 0.259813 0.0 0.0 0.0 0.0 1.0
AcceptedCmp1 2240.0 NaN NaN NaN 0.064286 0.245316 0.0 0.0 0.0 0.0 1.0
AcceptedCmp2 2240.0 NaN NaN NaN 0.012946 0.113069 0.0 0.0 0.0 0.0 1.0
Complain 2240.0 NaN NaN NaN 0.009375 0.096391 0.0 0.0 0.0 0.0 1.0
Response 2240.0 NaN NaN NaN 0.149107 0.356274 0.0 0.0 0.0 0.0 1.0
  • There are 5 different Education categoreis and 8 Marital statuses. We should look into these to see if they can be reduced and what the unique values are being listed as.
  • The most common Marital status is "Married"
  • The data includes customers born from 1893 - 1996. This might need to be looked at as 1893 is very far from the rest of the data and could have been an improper input.
  • Store purchases are on average higher than any other purchasing stream.
  • Very few people complain or respond to the marketing campaigns.
  • Median income is ~51,382 Dollars.
  • The maximum number of small kids in the household recorded is 2.
  • 50% of users visit more the web 6 or more times in a month.
In [11]:
# Gathering all the categorical features, including the encoded values of campaign responses and complaints
categorical_features = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain']
In [12]:
# Counting and listing the unique values in each categorical feature.
for feature in categorical_features:
    print("Unique values in", feature + ":\n")
    print(df[feature].value_counts())
    print('\n' + '-' * 100 + '\n')
Unique values in Education:

Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in Marital_Status:

Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in Kidhome:

0    1293
1     899
2      48
Name: Kidhome, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in Teenhome:

0    1158
1    1030
2      52
Name: Teenhome, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in AcceptedCmp1:

0    2096
1     144
Name: AcceptedCmp1, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in AcceptedCmp2:

0    2211
1      29
Name: AcceptedCmp2, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in AcceptedCmp3:

0    2077
1     163
Name: AcceptedCmp3, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in AcceptedCmp4:

0    2073
1     167
Name: AcceptedCmp4, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in AcceptedCmp5:

0    2077
1     163
Name: AcceptedCmp5, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in Response:

0    1906
1     334
Name: Response, dtype: int64

----------------------------------------------------------------------------------------------------

Unique values in Complain:

0    2219
1      21
Name: Complain, dtype: int64

----------------------------------------------------------------------------------------------------

  • We can remove 2n Cycle in Education because that is the same as a Master's
  • We can also merge Alone, Absurd, and YOLO, in Marital_Status as these are people who are single who put in comedic responses rather than their actual status.
In [13]:
# Replacing the unusual values in Education and Marital Status
df["Education"].replace('2n Cycle', 'Master', inplace = True)
df["Marital_Status"].replace({'Alone': 'Single', 'Absurd': 'Single', 'YOLO': 'Single'}, inplace = True)

# Checking to make sure the values got replaced properly
print(df['Education'].value_counts())
print()
print(df['Marital_Status'].value_counts())
Graduation    1127
Master         573
PhD            486
Basic           54
Name: Education, dtype: int64

Married     864
Together    580
Single      487
Divorced    232
Widow        77
Name: Marital_Status, dtype: int64

Our data now has no missing values as well as proper categorical variables. Next we will do Univariate analysis to find outliers and other trends within the data.

Univariate Analysis on Numerical and Categorical data¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

  • Plot histogram and box plot for different numerical features and understand how the data looks like.
  • Explore the categorical variables like Education, Kidhome, Teenhome, Complain.
  • A few questions have been mentioned below which will help approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, will be done.

Leading Questions:

  1. How does the distribution of Income variable vary across the dataset?
  2. The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
  3. There are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? At what percentile the upper whisker lies?
In [14]:
# Gathering all the numerical features
numerical_features = ['Year_Birth', 'Income', 'Recency', 'MntFishProducts', 'MntMeatProducts', 'MntFruits', 'MntSweetProducts', 'MntWines', 'MntGoldProds', 'NumDealsPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebPurchases', 'NumWebVisitsMonth']
In [15]:
# Creating a function to graph histograms and boxplots for each numerical variable
def plot_hist_boxplot(feature, data, hist_color='winter', box_color='violet'):
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 5), sharex=True, gridspec_kw = {'height_ratios': (0.25, 0.75)})
    
    # Boxplot
    sns.boxplot(data=data, x = feature, color=box_color, ax = ax1)
    ax1.set_title(f'Plots of {feature}')
    ax1.set_xlabel('')
    
    # Histplot
    sns.histplot(data=data, x=feature, kde=True, palette=hist_color, ax=ax2)
    
    plt.tight_layout()
    plt.show()
In [16]:
# Printing the graphs and the skew
for feature in numerical_features:
    plot_hist_boxplot(feature, df)
    print('Skew :', round(df[feature].skew(),2))
    print('\n' + '-' * 120 + '\n' * 3)
Skew : -0.35

------------------------------------------------------------------------------------------------------------------------



Skew : 6.8

------------------------------------------------------------------------------------------------------------------------



Skew : -0.0

------------------------------------------------------------------------------------------------------------------------



Skew : 1.92

------------------------------------------------------------------------------------------------------------------------



Skew : 2.08

------------------------------------------------------------------------------------------------------------------------



Skew : 2.1

------------------------------------------------------------------------------------------------------------------------



Skew : 2.14

------------------------------------------------------------------------------------------------------------------------



Skew : 1.18

------------------------------------------------------------------------------------------------------------------------



Skew : 1.89

------------------------------------------------------------------------------------------------------------------------



Skew : 2.42

------------------------------------------------------------------------------------------------------------------------



Skew : 1.88

------------------------------------------------------------------------------------------------------------------------



Skew : 0.7

------------------------------------------------------------------------------------------------------------------------



Skew : 1.38

------------------------------------------------------------------------------------------------------------------------



Skew : 0.21

------------------------------------------------------------------------------------------------------------------------



  • Recency seems to be Uniformly distributed.
  • Birth year has a few extreme outliers in the 1800s, these likely are data points that were meant to be listed as 1990s rather than 1890s and can be fixed.
  • Income follows a normal distribution, however it has some extreme outliers. Depending on how many values exist on these far ends, they could create a unique cluster so we might want to keep them. More analysis will be done on this later.
  • All of the product categoreis are heavily right skewed. This makes sense as each product will be purchased on the lower end, but some individuals will buy a lot of certain products.
  • All of the purchasing streams also have a right skew, with store purchase having the least skew and including no outliers.
  • Oddly enough, even though web purchases have a right skew, web visits seems to visually have somewhat of a left skew, but with high end outliers. This could have some interesting dynamics within our clusters later.
  • Some of the extreme outliers on the different product types could be wholesale purchasers attempting to resell the items, it will be interesting to see if these individuals fall into a cluster together.
In [17]:
# Function to update outlier Year_Birth values
def update_year_birth(year):
    if year >= 1800 and year < 1901:
        return year + 100
    else:
        return year

# Apply the function to the Year_Birth column
df['Year_Birth'] = df['Year_Birth'].apply(update_year_birth)

# Verify that the Year_Birth values have been updated
print(df['Year_Birth'].describe())
count    2240.000000
mean     1968.939732
std        11.740777
min      1940.000000
25%      1959.000000
50%      1970.000000
75%      1977.000000
max      2000.000000
Name: Year_Birth, dtype: float64
In [18]:
plot_hist_boxplot('Year_Birth', df)
  • Much better, now year is closer to a normal distribution although it is slightly bimodal, this likely represents different age generations, with slight gaps in between.
In [19]:
# Calculate the interquartile range (IQR) for the Income variable
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1

# Identify the upper bound for outliers
upper_bound = Q3 + 1.5 * IQR

# Find the number of outliers above the upper bound
outliers = df[df['Income'] > upper_bound]
print(f"Number of outliers above the upper bound: {len(outliers)}")

# Display the outliers
print("Outliers:")
print(outliers['Income'])
Number of outliers above the upper bound: 8
Outliers:
164     157243.0
617     162397.0
655     153924.0
687     160803.0
1300    157733.0
1653    157146.0
2132    156924.0
2233    666666.0
Name: Income, dtype: float64

These are some extreme outliers. Because there are only a few of them, we will drop them.

In [20]:
# Drop the outliers from the dataframe
df = df.drop(outliers.index)
In [21]:
# Custom function to add counts and percentages on top of the bars in a countplot
def add_counts_on_bars(ax):
    total = len(df)
    for p in ax.patches:
        count = p.get_height()
        percentage = 100 * count / total
        ax.annotate(f'{count} ({percentage:.1f}%)', (p.get_x() + p.get_width() / 2, count), ha='center', va='bottom')

# Visualize the distribution of categorical features
for feature in categorical_features:
    fig, ax = plt.subplots()
    sns.countplot(x=feature, data=df, ax=ax)
    plt.title(f'Distribution of {feature}')
    add_counts_on_bars(ax)  # Add the counts on top of the bars
    plt.show()
  • Very few customers complained, a little less than 1%, 21/2240.
  • The Education level section is confusing, as there is no explanation as to whether or not Graduation refers to high school or undergraduate degrees. Considering "Basic" exists, we will assume that Graduation refers to Bachelor's/undergraduate degrees, but note that this could lead to certain biases in our analysis. This seems to be the most common education level, with more than double any other category, and more than any of the other categories combined.
  • Most individuals are either Married or together with someone.
  • The distributions of kids and teens in the home are relatively similar, there are a few more households with 1 teen than with 1 kid. ~2% of the households have 2 kids and ~2% of the households have 2 teens.
  • All of the campaigns have a similar level of responses except for campaign 2 and the last campaign. Campaign 2 seems to have been quite ineffective compared to the others, and the last campaign was around twice as effective.

Bivariate Analysis¶

  • Analyze different categorical and numerical variables and check how different variables are related to each other.
  • Check the relationship of numerical variables with categorical variables.
In [22]:
# Calculate the correlation matrix for numerical features
corr_matrix = df[numerical_features].corr()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize = (10,10))
sns.heatmap(corr_matrix, annot=True, vmin = -1, vmax = 1, cmap='PuOr', fmt='.2f')
plt.show()
  • Store Purchases and Catalog Purchases are pretty highly negatively correlated to number of web visits despite both being positively correlated with number of web purchases.
  • In fact, web visits is negatively correlated with almost everything except number of deal purchases. This means that the more people visit the website the less they buy certain products, which is odd. Maybe this signifies window shoppers who like to look at the webpage but not purchase items?
  • Amount spent on meat purchases are positively correlated to catalog purchases, this is the most correlated of any variable pair.
  • Recency seems to have almost no correlation with any other variable, same with Year_Birth.
  • Amount spent on wine and meat are more positively correlated than wine and any other product.
  • All the product types are pretty highly correlated with each other to about the same degree, this means that if someone spends more on one product they are likely going to buy another as well.
  • Amount spent on wine and meat are also positively correlated with income.
  • Deals seem to only be correlated with web purchases and visits.
  • Amount spent on wine purchases are more correlated to stores and catalog's than the web. This seems to hold true for most other product categories as well, but especially so for Wine.
In [23]:
# Define the feature pairs
feature_groups = [
    [
        ('Education', 'MntFishProducts'),
        ('Education', 'MntMeatProducts'),
        ('Education', 'MntFruits'),
        ('Education', 'MntSweetProducts'),
        ('Education', 'MntWines'),
        ('Education', 'MntGoldProds'),
    ],
    [
        ('Marital_Status', 'MntFishProducts'),
        ('Marital_Status', 'MntMeatProducts'),
        ('Marital_Status', 'MntFruits'),
        ('Marital_Status', 'MntSweetProducts'),
        ('Marital_Status', 'MntWines'),
        ('Marital_Status', 'MntGoldProds'),
    ],
    [
        ('Education', 'NumDealsPurchases'),
        ('Education', 'NumCatalogPurchases'),
        ('Education', 'NumStorePurchases'),
        ('Education', 'NumWebPurchases'),
        ('Education', 'NumWebVisitsMonth'),
    ],
    [
        ('Marital_Status', 'NumDealsPurchases'),
        ('Marital_Status', 'NumCatalogPurchases'),
        ('Marital_Status', 'NumStorePurchases'),
        ('Marital_Status', 'NumWebPurchases'),
        ('Marital_Status', 'NumWebVisitsMonth'),
    ],
]

# Define color schemes for each group
color_palettes = ['Blues', 'Greens', 'Oranges', 'Purples']

# Iterate through the feature groups and create the bar plots
for palette, group in zip(color_palettes, feature_groups):
    n_rows = len(group)
    fig, axes = plt.subplots(1, n_rows, figsize=(20, 4))
    
    for idx, (x, y) in enumerate(group):
        sns.barplot(x=x, y=y, data=df, ci=None, ax=axes[idx], palette=palette)
        
        if x == 'Marital_Status':
            axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=45)
    
    plt.tight_layout(rect=[0, 0, 1, 0.95])  # Add space at the top
    plt.show()
  • Widows seem to spend more money on every category than any other marital status.
  • All of the education graphs compared to different product categories are similar except for meat and wine. Basic Education spends almost nothing on wine while PhD spends the most. Meat is in a similar spot to wine, however Graduation has more than PhD when it comes to meat.
  • Similarly Widows make the most purchases across each revenue stream, however they visit on the web the least, and divorced customers use the most deals.
  • Basic education visits on the web the most, PhD holders purchase the most on every stream.
In [24]:
# Choose the demographic variables to analyze
demographic_features = [
    ('Education', 'Income'),
    ('Marital_Status', 'Income'),
    ('Kidhome', 'Income'),
    ('Teenhome', 'Income'),
]

n_cols = len(demographic_features)
fig, axes = plt.subplots(2, n_cols, figsize=(20, 8))

# Plot mean income
for idx, (x, y) in enumerate(demographic_features):
    sns.barplot(x=x, y=y, data=df, ci=None, ax=axes[0, idx])

# Plot median income
for idx, (x, y) in enumerate(demographic_features):
    sns.barplot(x=x, y=y, data=df, ci=None, estimator=np.median, ax=axes[1, idx])

plt.tight_layout()
plt.show()
  • Average Income follows Education level as you'd expect, except for the fact that Graduation surpasses Masters.
  • There is no major meaningful difference in analysis between Mean and Median on any of these values, just slight alterations, which signifies that the outliers in Income are not significantly affecting these variables.
  • Widows make the most average income of any marital status.
  • A household of 0 kids makes the most income.
In [25]:
# Set up a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot Teenhome against Marital_Status
sns.barplot(x='Marital_Status', y='Teenhome', data=df, ci=None, ax=axes[0, 0])
axes[0, 0].set_ylabel('Average Number of Teenagers')
axes[0, 0].set_title('Average Number of Teenagers by Marital Status')

# Plot Teenhome against Education
sns.barplot(x='Education', y='Teenhome', data=df, ci=None, ax=axes[0, 1])
axes[0, 1].set_ylabel('Average Number of Teenagers')
axes[0, 1].set_title('Average Number of Teenagers by Education')

# Plot Kidhome against Marital_Status
sns.barplot(x='Marital_Status', y='Kidhome', data=df, ci=None, ax=axes[1, 0])
axes[1, 0].set_ylabel('Average Number of Small Children')
axes[1, 0].set_title('Average Number of Small Children by Marital Status')

# Plot Kidhome against Education
sns.barplot(x='Education', y='Kidhome', data=df, ci=None, ax=axes[1, 1])
axes[1, 1].set_ylabel('Average Number of Small Children')
axes[1, 1].set_title('Average Number of Small Children by Education')

# Adjust the layout
plt.tight_layout()
plt.show()
  • Widows have the most teenagers in the household and the least small children, on the other hand Single customers have the most small children and the least teenagers. These graphs are almost inversly related.
  • Same thing goes for education levels, the graphs are similar but opposites where PhD's have the most teens and the least small children, while Basic education individuals have the most small children and the least teens.

Feature Engineering and Data Processing¶

Think About It:

  • Can we extract the age of each customer and create a new feature?
  • Can we find the total kids and teens in the home?
  • Can we find out how many members each family has?
  • Can we find the total amount spent by the customers on various products?
  • Can we find out how long the customer has been with the company?
  • Can we find out how many offers the customers have accepted?
  • Can we find out amount spent per purchase?
In [26]:
# Create a feature of total children (teens + kids) in the household
df['ChildTotal'] = df['Teenhome'] + df['Kidhome']



# Creating a feature for total members of each household

# Add 1 to the household size for married couples or couples living together
df['AdditionalAdult'] = df['Marital_Status'].apply(lambda x: 1 if x in ['Married', 'Together'] else 0)

# Calculate the household size by combining Teenhome, Kidhome, and any additional adults
df['HouseholdSize'] = df['Teenhome'] + df['Kidhome'] + 1 + df['AdditionalAdult']

# Drop the 'AdditionalAdult' column as it's no longer needed
df.drop('AdditionalAdult', axis=1, inplace=True)



# Creating a feature for total amount spent
df['TotalSpent'] = (
    df['MntFishProducts'] +
    df['MntMeatProducts'] +
    df['MntFruits'] +
    df['MntSweetProducts'] +
    df['MntWines'] +
    df['MntGoldProds']
)

# Converting year_birth to age assuming the data was gathered in 2016
df['Age'] = 2016 - df['Year_Birth']

# Calculate the total number of purchases of each customer
df['TotalPurchases'] = (
    df['NumDealsPurchases'] +
    df['NumCatalogPurchases'] +
    df['NumStorePurchases'] +
    df['NumWebPurchases']
)

# Calculate how many offers each customer has accepted
df['TotalOffersAccepted'] = (
    df['AcceptedCmp1'] +
    df['AcceptedCmp2'] +
    df['AcceptedCmp3'] +
    df['AcceptedCmp4'] +
    df['AcceptedCmp5'] +
    df['Response']
)
In [27]:
# Filter out individuals with TotalPurchases of 0
df = df[df['TotalPurchases'] > 0]

# Create a feature 'AmountPerPurchase'
df['AmountPerPurchase'] = df['TotalSpent'] / df['TotalPurchases']

df['TotalPurchases'].sort_values()
Out[27]:
2214     1
2228     1
774      1
1328     2
9        2
        ..
1669    34
1252    34
412     35
432     39
21      43
Name: TotalPurchases, Length: 2230, dtype: int64
In [28]:
# Convert 'Dt_Customer' to a datetime object
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])

# Collection date
collection_date = pd.Timestamp('2016-01-01')

# Calculate the number of days the customer has been with the company
df['DaysWithCompany'] = (collection_date - df['Dt_Customer']).dt.days
In [29]:
df.head().T
Out[29]:
0 1 2 3 4
Year_Birth 1957 1954 1965 1984 1981
Education Graduation Graduation Graduation Graduation PhD
Marital_Status Single Single Together Together Married
Income 58138.0 46344.0 71613.0 26646.0 58293.0
Kidhome 0 1 0 1 1
Teenhome 0 1 0 0 0
Dt_Customer 2012-04-09 00:00:00 2014-08-03 00:00:00 2013-08-21 00:00:00 2014-10-02 00:00:00 2014-01-19 00:00:00
Recency 58 38 26 26 94
MntWines 635 11 426 11 173
MntFruits 88 1 49 4 43
MntMeatProducts 546 6 127 20 118
MntFishProducts 172 2 111 10 46
MntSweetProducts 88 1 21 3 27
MntGoldProds 88 6 42 5 15
NumDealsPurchases 3 2 1 2 5
NumWebPurchases 8 1 8 2 5
NumCatalogPurchases 10 1 2 0 3
NumStorePurchases 4 2 10 4 6
NumWebVisitsMonth 7 5 4 6 5
AcceptedCmp3 0 0 0 0 0
AcceptedCmp4 0 0 0 0 0
AcceptedCmp5 0 0 0 0 0
AcceptedCmp1 0 0 0 0 0
AcceptedCmp2 0 0 0 0 0
Complain 0 0 0 0 0
Response 1 0 0 0 0
ChildTotal 0 2 0 1 1
HouseholdSize 1 3 2 3 3
TotalSpent 1617 27 776 53 422
Age 59 62 51 32 35
TotalPurchases 25 6 21 8 19
TotalOffersAccepted 1 0 0 0 0
AmountPerPurchase 64.68 4.5 36.952381 6.625 22.210526
DaysWithCompany 1362 516 863 456 712

All our new features seem to be showing up properly. This will allow us to do some new analysis as well as to drop a lot of other variables in our data preparation steps later on.

In [30]:
new_features = ['Age', 'TotalPurchases', 'TotalOffersAccepted', 'DaysWithCompany', 'TotalSpent', 'AmountPerPurchase']

# Printing the graphs and the skew
for feature in new_features:
    plot_hist_boxplot(feature, df)
    print('Skew :', round(df[feature].skew(),2))
    print('\n' + '-' * 120 + '\n' * 3)
Skew : 0.08

------------------------------------------------------------------------------------------------------------------------



Skew : 0.23

------------------------------------------------------------------------------------------------------------------------



Skew : 2.42

------------------------------------------------------------------------------------------------------------------------



Skew : 0.01

------------------------------------------------------------------------------------------------------------------------



Skew : 0.86

------------------------------------------------------------------------------------------------------------------------



Skew : 22.2

------------------------------------------------------------------------------------------------------------------------



  • Days with company follows a normal distribution almost perfectly, it has a skew of 0.
  • Most people don't accept a single offer.
  • Total Purchases is bimodal with a peak at 5 and another at around 20.
  • Age is slightly bimodal but mostly uniform, this matches our year_birth graph before as all we did is convert it to age.
  • Amount per purchase is extremely left skewed with some crazy outliers. It seems most customers spend less than 60 dollars per purchase.
In [31]:
# Creating Countplot for HouseholdSize
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=df, x='HouseholdSize')

# Annotate the bars with counts and percentages
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height} ({height/len(df) * 100:.2f}%)',
                (p.get_x() + p.get_width() / 2, height),
                ha='center', va='bottom')

plt.title('Countplot for HouseholdSize')
plt.show()
  • The most common household size is 3. Higher sizes are less common than lower, although there are more 4 person households than 1 person, which is interesting.
In [32]:
# Remove income outliers
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
df_filtered = df[(df['Income'] >= Q1 - 1.5 * IQR) & (df['Income'] <= Q3 + 1.5 * IQR)]

# Creating graphs for each of the different comparisons
feature_pairs = [
    ('Age', 'TotalPurchases', 'scatterplot'),
    ('DaysWithCompany', 'TotalPurchases', 'scatterplot'),
    ('Age', 'Income', 'scatterplot'),
    ('TotalSpent', 'Income', 'scatterplot'),
    ('TotalOffersAccepted', 'Age', 'swarmplot'),
    ('TotalOffersAccepted', 'DaysWithCompany', 'swarmplot'),
    ('HouseholdSize', 'Income', 'barplot'),
    ('HouseholdSize', 'AmountPerPurchase', 'barplot')
]

# Create subplots
n_rows, n_cols = 4, 2
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16, 20))
axes = axes.flatten()

for idx, (x, y, plot_type) in enumerate(feature_pairs):
    if plot_type == 'scatterplot':
        # Create scatter plot
        sns.scatterplot(data=df_filtered, x=x, y=y, alpha=0.5, ax=axes[idx])
    elif plot_type == 'barplot':
        # Create bar plot
        sns.barplot(data=df_filtered, x=x, y=y, ci=None, ax=axes[idx])
    elif plot_type == 'swarmplot':
        # Create violin plot
        sns.swarmplot(data=df_filtered, x=x, y=y, ax=axes[idx])

    axes[idx].set_title(f'{x} vs. {y}')

# Adjust the layout
plt.tight_layout()
plt.show()
  • Age and Days with Company don't seem to have much impact on amount of purchases
  • Age and Days with Company also seem to center around the median values when compared to Total Offers Accepted.
  • The larger the household size the lower the income, which is interesting.
  • The larger the household size the lower the amount per purchase which is also interesting considering there are more people to purchase for. Both income and amountperpurchase bottom out at 4 though when it comes to household size, then slightly increase when going up to 5.
  • Age seems ever so slightly positively correlated with income as there are a few less observations for high income low age and a few less for low income high age.
  • There is a positive correlation between total spent and income, as income goes up so does totalspent.

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

  • The strong correlation between meats and wines.
  • The increase in expenditure with the Widow group.
  • How education level compares to the other categories, namely income and certain purchases, like wine and meats.
  • The increase in use of deals within the basic education category.

These different observations already start to give us an idea of some of the clusters we might see within our groups. We are likely to find a few groups relating to economic status/class, and a few groups relating heavily to relationship status or a mix of the two. Hypotheses within these can include that individuals with less money and education will purchase differently from those who have higher incomes and more education, and that both of these groups will purchase differently depending on if they are single or with a partner. Thus we can predict ~4 clusters already. We will soon see if this hypothesis holds up or if there are any other important clusters that we might find within the dataset.

Data Preparation for Segmentation¶

  • The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
  • Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
  • Plot the correlation plot after we've removed the irrelevant variables
  • Scale the Data
In [33]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2230 entries, 0 to 2239
Data columns (total 34 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Year_Birth           2230 non-null   int64         
 1   Education            2230 non-null   object        
 2   Marital_Status       2230 non-null   object        
 3   Income               2230 non-null   float64       
 4   Kidhome              2230 non-null   int64         
 5   Teenhome             2230 non-null   int64         
 6   Dt_Customer          2230 non-null   datetime64[ns]
 7   Recency              2230 non-null   int64         
 8   MntWines             2230 non-null   int64         
 9   MntFruits            2230 non-null   int64         
 10  MntMeatProducts      2230 non-null   int64         
 11  MntFishProducts      2230 non-null   int64         
 12  MntSweetProducts     2230 non-null   int64         
 13  MntGoldProds         2230 non-null   int64         
 14  NumDealsPurchases    2230 non-null   int64         
 15  NumWebPurchases      2230 non-null   int64         
 16  NumCatalogPurchases  2230 non-null   int64         
 17  NumStorePurchases    2230 non-null   int64         
 18  NumWebVisitsMonth    2230 non-null   int64         
 19  AcceptedCmp3         2230 non-null   int64         
 20  AcceptedCmp4         2230 non-null   int64         
 21  AcceptedCmp5         2230 non-null   int64         
 22  AcceptedCmp1         2230 non-null   int64         
 23  AcceptedCmp2         2230 non-null   int64         
 24  Complain             2230 non-null   int64         
 25  Response             2230 non-null   int64         
 26  ChildTotal           2230 non-null   int64         
 27  HouseholdSize        2230 non-null   int64         
 28  TotalSpent           2230 non-null   int64         
 29  Age                  2230 non-null   int64         
 30  TotalPurchases       2230 non-null   int64         
 31  TotalOffersAccepted  2230 non-null   int64         
 32  AmountPerPurchase    2230 non-null   float64       
 33  DaysWithCompany      2230 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(29), object(2)
memory usage: 674.3+ KB

We are going to drop all the categorical variables because they don't create distance, and we will be using distance algorithms. We will also drop demographic data.

In [34]:
# List of features to drop
drop_features = ['Year_Birth', 'Dt_Customer', 'Education', 'Marital_Status', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response', 'Complain', 'Kidhome', 'Teenhome', 'ChildTotal', 'HouseholdSize', 'Age', 'Income']

# Drop features
df_model = df.drop(drop_features, axis=1)
In [35]:
df_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2230 entries, 0 to 2239
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Recency              2230 non-null   int64  
 1   MntWines             2230 non-null   int64  
 2   MntFruits            2230 non-null   int64  
 3   MntMeatProducts      2230 non-null   int64  
 4   MntFishProducts      2230 non-null   int64  
 5   MntSweetProducts     2230 non-null   int64  
 6   MntGoldProds         2230 non-null   int64  
 7   NumDealsPurchases    2230 non-null   int64  
 8   NumWebPurchases      2230 non-null   int64  
 9   NumCatalogPurchases  2230 non-null   int64  
 10  NumStorePurchases    2230 non-null   int64  
 11  NumWebVisitsMonth    2230 non-null   int64  
 12  TotalSpent           2230 non-null   int64  
 13  TotalPurchases       2230 non-null   int64  
 14  TotalOffersAccepted  2230 non-null   int64  
 15  AmountPerPurchase    2230 non-null   float64
 16  DaysWithCompany      2230 non-null   int64  
dtypes: float64(1), int64(16)
memory usage: 378.1 KB
In [36]:
df_model.head()
Out[36]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth TotalSpent TotalPurchases TotalOffersAccepted AmountPerPurchase DaysWithCompany
0 58 635 88 546 172 88 88 3 8 10 4 7 1617 25 1 64.680000 1362
1 38 11 1 6 2 1 6 2 1 1 2 5 27 6 0 4.500000 516
2 26 426 49 127 111 21 42 1 8 2 10 4 776 21 0 36.952381 863
3 26 11 4 20 10 3 5 2 2 0 4 6 53 8 0 6.625000 456
4 94 173 43 118 46 27 15 5 5 3 6 5 422 19 0 22.210526 712
In [37]:
# Calculate the correlation matrix
corr_matrix = df_model.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Remaining Features')
plt.show()
  • As mentioned before, amount spent on wines and meat have the highest correlation with the total spent.
  • Number of Web visits is negatively correlated with all types of product purchases, total spent, total purchases, and amount spent per purchase.
  • Days with company and recency seem to once again have minimal correlation with any feature.
  • Web purchases seem less correlated with the amount spent on product types tahan store and catalog purchases. This is further backed up by the correlation between web purchases and total spent.
In [38]:
# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform it
df_scaled = scaler.fit_transform(df_model)

# Convert the scaled data back to a DataFrame
df_scaled = pd.DataFrame(df_scaled, columns=df_model.columns)
In [39]:
df_scaled.head()
Out[39]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth TotalSpent TotalPurchases TotalOffersAccepted AmountPerPurchase DaysWithCompany
0 0.306673 0.979391 1.546758 1.734463 2.456041 1.471693 0.838983 0.358982 1.406121 2.633344 -0.560010 0.697903 1.681105 1.330408 0.619884 0.696772 1.977019
1 -0.384115 -0.873681 -0.637898 -0.726847 -0.652363 -0.633485 -0.731893 -0.169072 -1.118556 -0.586150 -1.177627 -0.134801 -0.963121 -1.165801 -0.503971 -0.639154 -1.666952
2 -0.798588 0.358730 0.567430 -0.175331 1.340673 -0.149536 -0.042240 -0.697126 1.406121 -0.228428 1.292842 -0.551153 0.282492 0.804890 -0.503971 0.081251 -0.172321
3 -0.798588 -0.873681 -0.562565 -0.663035 -0.506085 -0.585090 -0.751050 -0.169072 -0.757888 -0.943871 -0.560010 0.281551 -0.919882 -0.903042 -0.503971 -0.591981 -1.925389
4 1.550091 -0.392595 0.416764 -0.216353 0.152165 -0.004351 -0.559479 1.415090 0.324116 0.129293 0.057607 -0.134801 -0.306222 0.542132 -0.503971 -0.246001 -0.822722

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [40]:
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=1)
df_tsne = tsne.fit_transform(df_scaled)

# Create a dataframe for the t-SNE components
df_tsne = pd.DataFrame(data=df_tsne, columns=['Component 1', 'Component 2'])

# Visualize the results
plt.figure(figsize=(10, 8))
sns.scatterplot(x='Component 1', y='Component 2', data=df_tsne)
plt.title('t-SNE Visualization of the Dataset')
plt.show()

There doesn't seem to be any very distinct clusters here. We can test other preplexity values and see if they provide any more insights. There seems to be some noticeable smaller clusters of overlapping points, but there are a lot of them and not a ton of variation between them.

In [41]:
# Apply t-SNE with perplexity 10
tsne_10 = TSNE(n_components=2, perplexity=10, random_state=1)
df_tsne_10 = tsne_10.fit_transform(df_scaled)
df_tsne_10 = pd.DataFrame(data=df_tsne_10, columns=['Component 1', 'Component 2'])

# Apply t-SNE with perplexity 20
tsne_20 = TSNE(n_components=2, perplexity=20, random_state=1)
df_tsne_20 = tsne_20.fit_transform(df_scaled)
df_tsne_20 = pd.DataFrame(data=df_tsne_20, columns=['Component 1', 'Component 2'])

# Apply t-SNE with perplexity 40
tsne_40 = TSNE(n_components=2, perplexity=40, random_state=1)
df_tsne_40 = tsne_40.fit_transform(df_scaled)
df_tsne_40 = pd.DataFrame(data=df_tsne_40, columns=['Component 1', 'Component 2'])

# Apply t-SNE with perplexity 100
tsne_100 = TSNE(n_components=2, perplexity=100, random_state=1)
df_tsne_100 = tsne_100.fit_transform(df_scaled)
df_tsne_100 = pd.DataFrame(data=df_tsne_100, columns=['Component 1', 'Component 2'])

# Merge t-SNE dataframes with the original dataframe to bring back the demographic variable 'Marital_Status'
df_tsne_10['Index'] = df.index
df_tsne_20['Index'] = df.index
df_tsne_40['Index'] = df.index
df_tsne_100['Index'] = df.index

df_tsne_10_merged = df_tsne_10.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_20_merged = df_tsne_20.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_40_merged = df_tsne_40.merge(df[['Marital_Status']], left_on='Index', right_index=True)
df_tsne_100_merged = df_tsne_100.merge(df[['Marital_Status']], left_on='Index', right_index =True)

# Visualize the results
fig, axes = plt.subplots(1, 4, figsize=(24, 6))

sns.scatterplot(ax=axes[0], x='Component 1', y='Component 2', data=df_tsne_10_merged, hue = 'Marital_Status')
axes[0].set_title('t-SNE Visualization (Perplexity 10)')

sns.scatterplot(ax=axes[1], x='Component 1', y='Component 2', data=df_tsne_20_merged, hue = 'Marital_Status')
axes[1].set_title('t-SNE Visualization (Perplexity 20)')

sns.scatterplot(ax=axes[2], x='Component 1', y='Component 2', data=df_tsne_40_merged, hue = 'Marital_Status')
axes[2].set_title('t-SNE Visualization (Perplexity 40)')

sns.scatterplot(ax=axes[3], x='Component 1', y='Component 2', data=df_tsne_100_merged, hue = 'Marital_Status')
axes[3].set_title('t-SNE Visualization (Perplexity 100)')

plt.show()

Once again there is minimal to no clustering observeable here, even after including a demographic variable in an attempt to observe patterns. None of the perplexity values seem to be giving us any noticeable clusters. We will move on to PCA.

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

Applying PCA can provide some dimensionality reduction, but we will also lose some interpretability in the process. We also might be able to reduce some of the noise of the data by applying PCA. We will test it here and see if we should perform our clusters on our original data or not.

In [42]:
# Fit PCA on the scaled data
n = df_scaled.shape[1]
pca = PCA(n_components = n, random_state = 1)
pca.fit(df_scaled)

# Calculate the cumulative explained variance
explained_variance = np.cumsum(pca.explained_variance_ratio_)

# Plot the explained variance graph
plt.figure(figsize=(6, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components')
plt.grid(True)
plt.show()
  • It seems that there is not a very sharp increase in principal component explained variance.
  • If we want to get at least 70% explained variance we must take 5 principal components, which is quite a few.
  • To get as high as 90% we need at least 10 components

Let's check the elbow graph and see if that gives us a better choice of components.

In [43]:
# Plot the elbow graph
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance')
plt.title('Elbow Plot for Principal Components')
plt.grid(True)
plt.show()
  • Here theres a sharp change after 2 components, and another at 3. We can pick either of these as the amount of principal components to use.
  • Combining this with the previous graph, we must be aware that we will only be explaining 53%-63% of the variance with those principal component choices though.
In [44]:
# Create a DataFrame with PCA coefficients
pca_coefficients = pd.DataFrame(np.round(pca.components_[:3,:], 2), columns=df_scaled.columns, index=['PC1', 'PC2', 'PC3'])

# Transpose the DataFrame
pca_coefficients = pca_coefficients.T

# Display the table with colored cells
plt.figure(figsize=(4, 12))
sns.heatmap(pca_coefficients, annot=True, cmap="coolwarm_r", center=0, vmin=-1, vmax=1)
plt.title('PCA Coefficients for 3 Principal Components')
plt.show()
  • All of the features besides NumWebVisitsMonth are related to higher values of PC1, however none of them stand out as extremely strong relationships.
  • NumDealsPurchases, NumWebVisitsMonth, NumWebPurchases, TotalPurchases, and DaysWithCompany are pretty strongly related to lower values of PC2. This could signify an interesting relationship between these features and potential clusters that we haven't noticed before.
  • If we choose to use 3 components, TotalOffersAccepted is very strongly related to higher values of PC3.
In [45]:
# Fit and transform the pca function on scaled data
df_pca = pd.DataFrame(pca.fit_transform(df_scaled))
df_pca_withdem = pd.concat([df_pca, df], axis = 1)

# Plot the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_pca_withdem, x=df_pca[0], y=df_pca[1], hue = 'Marital_Status')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title('Scatterplot of the First Two Principal Components')
plt.show()
  • This also seems to show minimal clustering. There are no real patterns to be observed here.

Given that we need at least 5 or more principal components to cover a significant amount of the variance, and that there are not initially observeable clusters within the data, we should be skeptical of its contribution to our analyses, and adding it might risk reducing interpretability without providing any added value.

We will test K-Means using PCA and not using PCA before clustering, but if it is less interpretable and doesn't provide meaningful analysis, for the future methods we will drop PCA as a piece of our analysis.

K-Means¶

We will create 2 K-Means models, one where we use the PCA dataset and one where we use the original scaled dataset. First we will find the optimal value using Sum of Squared Errors to find the optimal K value.

In [46]:
# Calculate the SSE for different K values
K_values = range(1, 11)
sse = []

for K in K_values:
    kmeans = KMeans(n_clusters=K, random_state=1)
    kmeans.fit(df_scaled)
    sse.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(16, 8))
plt.plot(K_values, sse, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.title('Elbow Method for Optimal K-value')
plt.show()
In [47]:
K_values = range(2, 11) # Trying K values from 2 to 10
silhouette_scores = []

for K in K_values:
    kmeans = KMeans(n_clusters=K, random_state=1)
    cluster_labels = kmeans.fit_predict(df_scaled)
    silhouette_avg = silhouette_score(df_scaled, cluster_labels)
    silhouette_scores.append(silhouette_avg)
    print(f"For K = {K}, the average silhouette_score is: {silhouette_avg}")

plt.figure(figsize=(10, 6))
plt.plot(K_values, silhouette_scores, 'bx-')
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Different K Values")
plt.show()
For K = 2, the average silhouette_score is: 0.35053865728907435
For K = 3, the average silhouette_score is: 0.27186197643097576
For K = 4, the average silhouette_score is: 0.25229910270415834
For K = 5, the average silhouette_score is: 0.23745618918030156
For K = 6, the average silhouette_score is: 0.2386255804586004
For K = 7, the average silhouette_score is: 0.2162783763062503
For K = 8, the average silhouette_score is: 0.1297028510069316
For K = 9, the average silhouette_score is: 0.13012755943470536
For K = 10, the average silhouette_score is: 0.12142398323171581

On both our graphs K = 2 seems like the most optimal, however 2 clusters would not give us very meaningful analyses or clustering information. The next best value would be K = 3. This is because it has a noticeable drop on the elbow plot and also has the highest silhouette score out of the remaining values besides K = 2. We will use K = 3 for our K-Means models

In [48]:
# Create K-means models
kmeans_pca = KMeans(n_clusters=3, random_state=1)
kmeans_original = KMeans(n_clusters=3, random_state=1)

# Fit the models
kmeans_pca.fit(df_pca)
kmeans_original.fit(df_scaled)

# Get the cluster labels
labels_pca = kmeans_pca.labels_
labels_original = kmeans_original.labels_

# Create a new DataFrame with the cluster labels
df_labels_original = df_scaled.copy()
df_labels_original['Clusters'] = labels_original

df_labels_pca = df_pca.copy()
df_labels_pca['Clusters'] = labels_pca
In [49]:
# Checking to see if PCA is providing valuable clustering differences by comparing the value counts of each cluster
value_counts_original = df_labels_original['Clusters'].value_counts()
value_counts_pca = df_labels_pca['Clusters'].value_counts()

print("Value counts for original dataset clusters:")
print(value_counts_original)
print("\nValue counts for PCA transformed dataset clusters:")
print(value_counts_pca)
Value counts for original dataset clusters:
0    1060
2     607
1     563
Name: Clusters, dtype: int64

Value counts for PCA transformed dataset clusters:
0    1060
2     607
1     563
Name: Clusters, dtype: int64

It seems our PCA and original dataset models are resulting in the same clusters. This means that PCA is not capturing any new analysis and to maintain high interpretability we will not use PCA in our clustering analysis. From here on out all our models will be using our original data.

In [50]:
# Copying the information to new dataframes
df1 = df.copy()
df_Kmeans = df_scaled.copy()

df1['KMeansClusters'] = labels_original
df_Kmeans['KMeansClusters'] = labels_original
In [51]:
kmeans_cluster_profile = df1.groupby("KMeansClusters").mean()

# Creating the ValueCount feature for observation purposes
kmeans_cluster_profile["ValueCount"] = (value_counts_original)
kmeans_cluster_profile = kmeans_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
kmeans_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[51]:
KMeansClusters 0 1 2
Year_Birth 1971.023585 1968.214920 1965.925865
Income 35357.330660 75916.195382 57680.852554
Kidhome 0.754717 0.035524 0.281713
Teenhome 0.475472 0.206039 0.843493
Recency 49.283019 50.376554 47.673806
MntWines 45.128302 633.801066 454.586491
MntFruits 5.065094 70.046181 23.186161
MntMeatProducts 23.672642 461.600355 138.413509
MntFishProducts 7.373585 102.397869 30.570016
MntSweetProducts 5.163208 71.344583 24.663921
MntGoldProds 15.287736 79.190053 62.253707
NumDealsPurchases 2.024528 1.300178 3.782537
NumWebPurchases 2.141509 5.213144 6.492586
NumCatalogPurchases 0.572642 6.023091 3.107084
NumStorePurchases 3.292453 8.328597 7.883031
NumWebVisitsMonth 6.372642 2.886323 5.752883
AcceptedCmp3 0.067925 0.085258 0.070840
AcceptedCmp4 0.015094 0.133215 0.125206
AcceptedCmp5 0.000000 0.264654 0.023064
AcceptedCmp1 0.000943 0.214920 0.036244
AcceptedCmp2 0.001887 0.031972 0.014827
Complain 0.011321 0.007105 0.008237
Response 0.083962 0.291297 0.133443
ChildTotal 1.230189 0.241563 1.125206
HouseholdSize 2.881132 1.859680 2.782537
TotalSpent 101.690566 1418.380107 733.673806
Age 44.976415 47.785080 50.074135
TotalPurchases 8.031132 20.865009 21.265239
TotalOffersAccepted 0.169811 1.021314 0.403624
AmountPerPurchase 11.315050 73.519124 34.359878
DaysWithCompany 864.909434 914.646536 958.739703
ValueCount 1060.000000 563.000000 607.000000
In [52]:
df1.groupby(["KMeansClusters", "Marital_Status"])['Year_Birth'].count()
Out[52]:
KMeansClusters  Marital_Status
0               Divorced          103
                Married           413
                Single            239
                Together          277
                Widow              28
1               Divorced           54
                Married           201
                Single            135
                Together          147
                Widow              26
2               Divorced           73
                Married           247
                Single            112
                Together          152
                Widow              23
Name: Year_Birth, dtype: int64
In [53]:
df1.groupby(["KMeansClusters", "HouseholdSize"])['Year_Birth'].count()
Out[53]:
KMeansClusters  HouseholdSize
0               1                 48
                2                282
                3                499
                4                210
                5                 21
1               1                175
                2                293
                3                 94
                4                  1
2               1                 29
                2                185
                3                293
                4                 89
                5                 11
Name: Year_Birth, dtype: int64
In [54]:
df1.groupby(["KMeansClusters", "Education"])['Year_Birth'].count()
Out[54]:
KMeansClusters  Education 
0               Basic          52
                Graduation    519
                Master        291
                PhD           198
1               Graduation    321
                Master        132
                PhD           110
2               Basic           2
                Graduation    282
                Master        149
                PhD           174
Name: Year_Birth, dtype: int64

-

In [55]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='KMeansClusters', y=feature, data=df1, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

Cluster 0

  • This cluster has 1060 individuals
  • It includes people of 30,000 - 45,000 dollar income range
  • These individuals spend less on all product categories and have the least purchases in each purchasing stream

This cluster seems to represent low-income individuals

Cluster 1

  • This cluster has 563 individuals
  • It includes people of 70,000 - 82,000 dollar income range
  • These individuals spend more on all product categories and have a high number of store and catalog purchases

This cluster seems to represent high-income individuals

Cluster 2

  • This cluster has 607 individuals
  • It includes people of the 50,000 - 63,000 dollar income range
  • Thse individuals spend a medium amount on all product categories and have the highest number of web purchases

This cluster seems to represent mid-income individuals

These clusters are something we can already predict, High income individuals will spend more money overall and have more purchases. This is not very informative. K-Means as a result is not likely to be a very good model for what we want to achieve. Targetting higher income individuals with advertisements isn't guaranteed to give us any gain, and we will miss out on the ability to market potential products to the low and middle income individuals if we do so.

K-Medoids¶

Given that K = 3 on K-Means gave us uninformative groupings, because we are assuming that K = 3 will give us similar results on K-Medoids we will test K = 4 and K = 5 as well to get possible new insights.

In [56]:
# Creating a new dataframe copy
Kmed_df = df_scaled.copy()
Kmed4_df = df_scaled.copy()
Kmed5_df = df_scaled.copy()
df2 = df.copy()

# Create K-means models
Kmed = KMedoids(n_clusters=3, random_state=1)
Kmed4 = KMedoids(n_clusters = 4, random_state=1)
Kmed5 = KMedoids(n_clusters = 5, random_state=1)

# Fit the models
Kmed.fit(Kmed_df)
Kmed4.fit(Kmed4_df)
Kmed5.fit(Kmed5_df)

# Get the cluster labels
Kmed_labels = Kmed.labels_
Kmed4_labels = Kmed4.labels_
Kmed5_labels = Kmed5.labels_

# Create a new DataFrame with the cluster labels
Kmed_df['KMedClusters'] = Kmed_labels
Kmed4_df['KMedClusters'] = Kmed4_labels
Kmed5_df['KMedClusters'] = Kmed5_labels
df2['KMed3Clusters'] = Kmed_labels
df2['KMed4Clusters'] = Kmed4_labels
df2['KMed5Clusters'] = Kmed5_labels
In [57]:
Kmed_df['KMedClusters'].value_counts()
Out[57]:
1    1153
0     543
2     534
Name: KMedClusters, dtype: int64
In [58]:
Kmed4_df['KMedClusters'].value_counts()
Out[58]:
1    964
3    513
0    436
2    317
Name: KMedClusters, dtype: int64
In [59]:
Kmed5_df['KMedClusters'].value_counts()
Out[59]:
1    632
2    578
0    481
3    276
4    263
Name: KMedClusters, dtype: int64
In [60]:
# Creating the cluster profiles for each of the models
kmed_cluster_profile = Kmed_df.groupby("KMedClusters").mean()
kmed4_cluster_profile = Kmed4_df.groupby("KMedClusters").mean()
kmed5_cluster_profile = Kmed5_df.groupby("KMedClusters").mean()

# Creating the ValueCount feature for observation purposes
kmed_cluster_profile["ValueCount"] = (Kmed_df['KMedClusters'].value_counts())
kmed_cluster_profile = kmed_cluster_profile.T
kmed4_cluster_profile["ValueCount"] = (Kmed4_df['KMedClusters'].value_counts())
kmed4_cluster_profile = kmed4_cluster_profile.T
kmed5_cluster_profile["ValueCount"] = (Kmed5_df['KMedClusters'].value_counts())
kmed5_cluster_profile = kmed5_cluster_profile.T

K = 3 Clusters¶

In [61]:
# Highlighting the maximum average value among all the clusters for each of the variables
kmed_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[61]:
KMedClusters 0 1 2
Recency -0.067600 -0.002025 0.073111
MntWines 0.742206 -0.737002 0.836602
MntFruits 0.005831 -0.508554 1.092127
MntMeatProducts 0.027268 -0.626263 1.324484
MntFishProducts 0.023566 -0.529476 1.119270
MntSweetProducts 0.083972 -0.517031 1.030973
MntGoldProds 0.242434 -0.507259 0.848741
NumDealsPurchases 0.731440 -0.088925 -0.551763
NumWebPurchases 1.068036 -0.626196 0.266031
NumCatalogPurchases 0.439582 -0.701874 1.068480
NumStorePurchases 0.779298 -0.715084 0.751559
NumWebVisitsMonth 0.210242 0.422742 -1.126560
TotalSpent 0.454936 -0.802633 1.270422
TotalPurchases 1.064020 -0.812227 0.671789
TotalOffersAccepted 0.164547 -0.308051 0.497817
AmountPerPurchase 0.112720 -0.466889 0.893476
DaysWithCompany 0.289718 -0.124515 -0.025752
ValueCount 543.000000 1153.000000 534.000000
In [62]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='KMed3Clusters', y=feature, data=df2, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

These clusters seem quite similar to our clusters when we did K-Means as well. There are some slight differences in cluster sizes, and a bit more overlap between the middle income and high income clusters, but generally the insights are the same and uninteresting, as such we will look at the K = 4 and K = 5 models instead.

K = 4 Clusters¶

In [63]:
# Highlighting the maximum average value among all the clusters for each of the variables
kmed4_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[63]:
KMedClusters 0 1 2 3
Recency 0.208045 -0.003716 -0.457007 0.112565
MntWines 1.075523 -0.805400 0.605775 0.225043
MntFruits 1.215305 -0.550062 0.428329 -0.263925
MntMeatProducts 1.509669 -0.666378 0.395495 -0.275242
MntFishProducts 1.264682 -0.574956 0.482790 -0.292763
MntSweetProducts 1.321236 -0.543246 0.296476 -0.285287
MntGoldProds 0.771055 -0.580345 0.305125 0.246684
NumDealsPurchases -0.471855 -0.237544 -0.212382 0.978647
NumWebPurchases 0.471361 -0.787445 0.591488 0.713610
NumCatalogPurchases 1.215585 -0.769835 0.834580 -0.102215
NumStorePurchases 0.810505 -0.831339 0.973317 0.271907
NumWebVisitsMonth -0.920712 0.440490 -0.775746 0.434132
TotalSpent 1.506375 -0.870605 0.602682 -0.016698
TotalPurchases 0.845570 -0.982360 0.883221 0.581571
TotalOffersAccepted 0.764232 -0.316273 -0.043084 -0.028578
AmountPerPurchase 0.982930 -0.508595 0.296147 -0.062672
DaysWithCompany 0.254744 -0.199049 -0.373541 0.388358
ValueCount 436.000000 964.000000 317.000000 513.000000
In [64]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='KMed4Clusters', y=feature, data=df2, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

Cluster 0

  • This cluster has 436 individuals
  • This cluster spends the most amount on every product category
  • This cluster has the least amount of deal utilization
  • This cluster has a high amount of catalog and store purchases
  • This cluster has a low household size and few children
  • This cluster has the most income

This cluster seems to represent high income individuals who purchase a lot due to their high income.

Cluster 1

  • This cluster has 964 individuals
  • This is the largest cluster
  • This cluster spends the least on every product category
  • This cluster has the lowest number of purchases on all purchase streams
  • This cluster visits the web the most
  • This cluster has the lowest income

This cluster seems to represent low income individuals who don't purchase much due to their low income.

Cluster 2

  • This cluster has 317 individuals
  • This is the smallest cluster
  • This cluster is 2nd in every product category
  • This cluster has a high amount of catalog and store purchases
  • This cluster has an average household size, low amount of kids in the home.
  • This cluster has the 2nd most income, and is pretty close to the most.

This cluster seems to represent relatively high income individuals who perhaps are less inclined to spend money as cluster 0. They still spend a lot but don't quite match the extreme expenditure of their peers.

Cluster 3

  • This cluster has 513 individuals
  • This cluster is 3rd on every feature except for deals utilized.
  • This cluster utilizes deals the most.

This cluster seems to represent the middle class individuals.

These clusters aren't much more meaningful than the K = 3 models. We really only added one more category which was a upper-middle class type of tier, but it still breaks down along income pretty typically. Let's look at K = 5 and see if that is any better.

K = 5 Clusters¶

In [65]:
# Highlighting the maximum average value among all the clusters for each of the variables
kmed5_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[65]:
KMedClusters 0 1 2 3 4
Recency 0.053264 -0.341651 -0.098179 -0.104922 1.049466
MntWines 1.045730 -0.834037 0.577155 -0.509660 -0.641878
MntFruits 1.231957 -0.575915 0.026457 -0.460847 -0.443694
MntMeatProducts 1.433987 -0.689200 0.066124 -0.501474 -0.585497
MntFishProducts 1.305966 -0.594037 -0.003192 -0.450304 -0.481404
MntSweetProducts 1.271725 -0.573910 -0.004226 -0.465768 -0.448646
MntGoldProds 0.746742 -0.647717 0.369503 -0.141357 -0.472945
NumDealsPurchases -0.491833 -0.398842 0.419279 1.028615 -0.142970
NumWebPurchases 0.502576 -0.881725 0.739072 0.060149 -0.487730
NumCatalogPurchases 1.168247 -0.830668 0.449881 -0.520049 -0.583430
NumStorePurchases 0.845358 -0.901067 0.750022 -0.499591 -0.504823
NumWebVisitsMonth -0.932880 0.506196 -0.186665 1.008658 -0.158547
TotalSpent 1.461421 -0.906373 0.380585 -0.584142 -0.718143
TotalPurchases 0.849412 -1.108842 0.857851 -0.125714 -0.642281
TotalOffersAccepted 0.720353 -0.306585 -0.027596 -0.104921 -0.409960
AmountPerPurchase 0.945786 -0.545009 0.148309 -0.334234 -0.395251
DaysWithCompany 0.124408 -0.269283 0.117662 0.748832 -0.624865
ValueCount 481.000000 632.000000 578.000000 276.000000 263.000000
In [66]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='KMed5Clusters', y=feature, data=df2, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

Cluster 0

  • This cluster has 481 individuals
  • This cluster is the highest on every product category
  • This cluster uses the least amount of deals
  • This cluster makes the most catalog and store purchases
  • This cluster visits the web the least

This cluster seems to represent High income individuals once again. These individuals spend a lot per purchase comparatively and purchase a lot of items. They also seem quite receptive to the previous marketing campaigns. They perform bulk shopping all at once.

Cluster 1

  • This cluster has 632 individuals
  • This cluster spends the least on every product category
  • This cluster has the least amount of purchases
  • This cluster rarely responds to the offers
  • This cluster is the youngest and have been with the company the shortest amount of tim

This cluster seems to represent relatively new customers. They also don't have high income and are on the younger end of the age spectrum.

Cluster 2

  • This cluster has 578 individuals
  • This cluster spends the second most on every product category
  • This cluster makes the most total purchases
  • This cluster also has a high amount of web purchases and store purchases
  • This cluster has the 2nd highest income

This cluster seems to represent more middle class. They purchase often at the store, their purchasing behavior follows the high income individuals, but just with less amounts overall.

Cluster 3

  • This cluster has 276 individuals
  • This cluster has the second lowest income
  • This cluster makes the most web visits and the most deal purchases
  • This cluster spends the 3rd most and makes the 3rd most purchases.

This cluster seems to represent consistent customers who are lower income. They have been with the company for a while and value the products, but they purchase in different ways than the higher income individuals.

Cluster 4

  • This cluster has 263 individuals
  • This cluster has no individuals who have purchased recently
  • This cluster is relatively middle income
  • These individuals didn't buy many products

This cluster seems to represent customers who have churned. They tried the company and then haven't really returned.

This seems to be the most informative of the models so far. We have more behavioral analysis, even though the groupings still break down along income line pretty consistently. This is now the model to beat for our analyses and models going forward

Hierarchical Clustering¶

  • Find the Cophenetic correlation for different distances with different linkage methods.
  • Create the dendrograms for different linkages
  • Explore different linkages with each distance metric
In [67]:
# Creating a copy of the scaled dataframe
hierarchical_df = df_scaled.copy()

# Define distance metrics and linkage methods
distance_metrics = ['euclidean', 'chebyshev', 'mahalanobis', 'cityblock']
linkage_methods = ['single', 'complete', 'average', 'weighted']

high_correlation = 0
high_distandLink = [0, 0]

# Calculate cophenetic correlations for each combination
for dist in distance_metrics:
    for link in linkage_methods:
        try:
            Z = linkage(hierarchical_df, metric=dist, method=link)
            c, coph_dists = cophenet(Z, pdist(hierarchical_df))
            print("Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dist.capitalize(), link, c))
        except ValueError:
            print(f"{dist}, {link}: ValueError occurred")
        if high_correlation < c:
            high_correlation = c
            high_distandLink[0] = dist
            high_distandLink[1] = link

# Printing the higheset combination
print('-'*100)
print(
    "Highest cophenetic correlation is {}, with {} distance and {} linkage.".format(
        high_correlation, high_distandLink[0].capitalize(), high_distandLink[1]
    )
)
Cophenetic correlation for Euclidean distance and single linkage is 0.8062977229650866.
Cophenetic correlation for Euclidean distance and complete linkage is 0.815553947067454.
Cophenetic correlation for Euclidean distance and average linkage is 0.8551976844583773.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.6861600812765698.
Cophenetic correlation for Chebyshev distance and single linkage is 0.6527214094382623.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.6518836492152121.
Cophenetic correlation for Chebyshev distance and average linkage is 0.7693410640044095.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.5418169857754074.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.7981563287295148.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6709942422069805.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.8203778973606309.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7480302954812181.
Cophenetic correlation for Cityblock distance and single linkage is 0.8553682275733636.
Cophenetic correlation for Cityblock distance and complete linkage is 0.5786349982207821.
Cophenetic correlation for Cityblock distance and average linkage is 0.7909021181194902.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.7437982452873503.
----------------------------------------------------------------------------------------------------
Highest cophenetic correlation is 0.8553682275733636, with Cityblock distance and single linkage.

It seems Cityblock with single linkage creates the best correlation, let's compare some of the different linkages in Cityblock distance. We can also try Euclidean distance becaues it has high correlation values, one of which is almost as high as the Cityblock with single linkage

Cityblock distance:¶

In [68]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
    Z = linkage(hierarchical_df, metric='cityblock', method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Cityblock Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction")

These clusters all look pretty bad, it seems like we will get 2 pretty sizeable clusters and then a few really small clusters if we use this. Let's check the other distance metrics to see if we can find any good clusters before we decide. We do have high correlation values but these graphs are not initially seeming to give good clusters.

Euclidean distance:¶

In [69]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
    Z = linkage(hierarchical_df, metric='euclidean', method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Euclidean Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction")

Ward linkage seems pretty good here visually. We also have a very high correlation value in average linkage. This wouldn't be a bad choice to test because if average linkage clusters poorly we can switch to ward linkage.

Chebyshev distance:¶

In [70]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
    Z = linkage(hierarchical_df, metric='chebyshev', method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Chebyshev Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction")

Doesn't seem to have very good clusters, this is similar to Cityblock but with worse correlation values.

Mahalanobis distance:¶

In [71]:
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# Plotting Dendrogram for each linkage method
for i, method in enumerate(linkage_methods):
    Z = linkage(hierarchical_df, metric='mahalanobis', method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Mahalanobis Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hierarchical_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction")

Like chebyshev this doesn't seem to have very good clusters, this is similar to Cityblock but with worse correlation values.

After looking at all the methods Euclidean seems to be the best after looking at them visually. Let's first test average linkage as it has the highest correlation value. If it clusters poorly we will try ward linkage.

In [72]:
HC = AgglomerativeClustering(n_clusters = 4, affinity = "euclidean", linkage = "average")
HC.fit(hierarchical_df)
Out[72]:
AgglomerativeClustering(linkage='average', n_clusters=4)
In [73]:
# Creating a copy of the original data
df3 = df.copy()

# Adding hierarchical cluster labels to the dataframes
hierarchical_df["HC_segments"] = HC.labels_
df3["HC_segments"] = HC.labels_
In [74]:
# Creating the cluster profile
hc_cluster_profile = df3.groupby("HC_segments").mean()

# Creating the ValueCount feature for observation purposes
hc_cluster_profile["ValueCount"] = (df3['HC_segments'].value_counts())
hc_cluster_profile = hc_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[74]:
HC_segments 0 1 2 3
Year_Birth 1968.922662 1966.500000 1979.000000 1978.000000
Income 51709.240108 44171.875000 2447.000000 51381.500000
Kidhome 0.444694 0.250000 1.000000 0.000000
Teenhome 0.507644 0.750000 0.000000 0.000000
Recency 49.156924 30.000000 42.000000 53.000000
MntWines 305.961781 27.000000 1.000000 32.000000
MntFruits 26.468076 2.750000 1.000000 2.000000
MntMeatProducts 164.392086 12.750000 1725.000000 1607.000000
MntFishProducts 37.768885 2.750000 1.000000 12.000000
MntSweetProducts 27.012140 132.750000 1.000000 4.000000
MntGoldProds 43.874550 244.250000 1.000000 22.000000
NumDealsPurchases 2.319694 0.000000 15.000000 0.000000
NumWebPurchases 4.066547 25.500000 0.000000 0.000000
NumCatalogPurchases 2.632644 0.250000 28.000000 0.000000
NumStorePurchases 5.828237 0.250000 0.000000 1.000000
NumWebVisitsMonth 5.336331 0.750000 1.000000 0.000000
AcceptedCmp3 0.073291 0.000000 0.000000 0.000000
AcceptedCmp4 0.074640 0.000000 0.000000 1.000000
AcceptedCmp5 0.073291 0.000000 0.000000 0.000000
AcceptedCmp1 0.064748 0.000000 0.000000 0.000000
AcceptedCmp2 0.013040 0.000000 0.000000 0.000000
Complain 0.009442 0.000000 0.000000 0.000000
Response 0.150180 0.000000 0.000000 0.000000
ChildTotal 0.952338 1.000000 1.000000 0.000000
HouseholdSize 2.597122 2.250000 3.000000 2.000000
TotalSpent 605.477518 422.250000 1730.000000 1679.000000
Age 47.077338 49.500000 37.000000 38.000000
TotalPurchases 14.847122 26.000000 43.000000 1.000000
TotalOffersAccepted 0.449191 0.000000 0.000000 1.000000
AmountPerPurchase 32.579848 16.212963 40.232558 1679.000000
DaysWithCompany 902.942896 874.250000 944.000000 1119.000000
ValueCount 2224.000000 4.000000 1.000000 1.000000
In [75]:
df3.groupby(["HC_segments", "Marital_Status"])['Year_Birth'].count()
Out[75]:
HC_segments  Marital_Status
0            Divorced          230
             Married           859
             Single            483
             Together          575
             Widow              77
1            Married             1
             Single              3
2            Married             1
3            Together            1
Name: Year_Birth, dtype: int64

This is horrible clustering. We have one cluster with 2224 individuals and 6 total individuals across the other 3 clusters. Let's try hierarchical with ward because even though it has a low cophenetic correlation, it seems to cluster pretty well visually. There seem to be 3 noticeable main clusters so we will use n_clusters = 3.

In [76]:
HC = AgglomerativeClustering(n_clusters = 3, affinity = "euclidean", linkage = "ward")
HC.fit(hierarchical_df)

# Adding hierarchical cluster labels to the dataframes
hierarchical_df["HCClusters"] = HC.labels_
df3["HCClusters"] = HC.labels_

# Creating the cluster profile
hc_cluster_profile = df3.groupby("HCClusters").mean()

# Creating the ValueCount feature for observation purposes
hc_cluster_profile["ValueCount"] = (df3['HCClusters'].value_counts())
hc_cluster_profile = hc_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[76]:
HCClusters 0 1 2
Year_Birth 1967.497260 1971.684829 1966.200355
Income 72689.812329 33556.030449 54538.756206
Kidhome 0.050685 0.818376 0.333333
Teenhome 0.343836 0.460470 0.797872
Recency 49.253425 48.900641 49.315603
MntWines 572.979452 32.994658 410.358156
MntFruits 62.178082 3.979701 17.312057
MntMeatProducts 390.252055 18.795940 117.932624
MntFishProducts 89.641096 5.619658 23.624113
MntSweetProducts 62.949315 4.132479 19.131206
MntGoldProds 73.245205 13.430556 57.689716
NumDealsPurchases 1.616438 1.970085 3.812057
NumWebPurchases 5.345205 1.957265 6.049645
NumCatalogPurchases 5.349315 0.446581 2.767730
NumStorePurchases 8.661644 3.038462 6.732270
NumWebVisitsMonth 3.141096 6.540598 6.129433
AcceptedCmp3 0.072603 0.075855 0.069149
AcceptedCmp4 0.123288 0.009615 0.120567
AcceptedCmp5 0.208219 0.000000 0.019504
AcceptedCmp1 0.175342 0.000000 0.028369
AcceptedCmp2 0.027397 0.002137 0.012411
Complain 0.009589 0.012821 0.003546
Response 0.238356 0.094017 0.127660
ChildTotal 0.394521 1.278846 1.131206
HouseholdSize 2.021918 2.923077 2.797872
TotalSpent 1251.245205 78.952991 646.047872
Age 48.502740 44.315171 49.799645
TotalPurchases 20.972603 7.412393 19.361702
TotalOffersAccepted 0.845205 0.181624 0.377660
AmountPerPurchase 64.101006 9.934937 32.178778
DaysWithCompany 909.200000 872.924145 944.914894
HC_segments 0.004110 0.000000 0.010638
ValueCount 730.000000 936.000000 564.000000
In [77]:
df3.groupby(["HCClusters", "Marital_Status"])['Year_Birth'].count()
Out[77]:
HCClusters  Marital_Status
0           Divorced           77
            Married           261
            Single            164
            Together          197
            Widow              31
1           Divorced           91
            Married           365
            Single            218
            Together          238
            Widow              24
2           Divorced           62
            Married           235
            Single            104
            Together          141
            Widow              22
Name: Year_Birth, dtype: int64
In [78]:
df3.groupby(["HCClusters", "Education"])['Year_Birth'].count()
Out[78]:
HCClusters  Education 
0           Basic           1
            Graduation    402
            Master        173
            PhD           154
1           Basic          51
            Graduation    454
            Master        252
            PhD           179
2           Basic           2
            Graduation    266
            Master        147
            PhD           149
Name: Year_Birth, dtype: int64
In [79]:
df3.groupby(["HCClusters", "HouseholdSize"])['Year_Birth'].count()
Out[79]:
HCClusters  HouseholdSize
0           1                188
            2                353
            3                174
            4                 15
1           1                 38
            2                238
            3                440
            4                198
            5                 22
2           1                 26
            2                169
            3                272
            4                 87
            5                 10
Name: Year_Birth, dtype: int64
In [80]:
df3.groupby(["HCClusters", "TotalOffersAccepted"])['Year_Birth'].count()
Out[80]:
HCClusters  TotalOffersAccepted
0           0                      418
            1                      149
            2                       75
            3                       43
            4                       36
            5                        9
1           0                      799
            1                      104
            2                       33
2           0                      404
            1                      117
            2                       34
            3                        8
            4                        1
Name: Year_Birth, dtype: int64
  • Cluster 1 has the most people and includes a large majority of "basic" income. Perhaps this is our average consumer.
  • Cluster 0 has no householdsize 5 and minimal household size 4. It also has the most 2 person households. This could likely be single parents or young couples with minimal or no children.
In [81]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='HCClusters', y=feature, data=df3, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

Cluster 0

  • This cluster has 730 individuals
  • This cluster has the highest income
  • This cluster has a smaller average household size
  • This cluster spends the most on every product category
  • This cluster buys in store more than any other stream

This cluster seems to represent high income individuals who make a lot of purchases.

Cluster 1

  • This cluster has 936 individuals
  • This cluster spends the least on every product category
  • This cluster has the lowest income
  • This cluster visits on the web the most

This cluster seems to represent low income individuals, they also seem to have a high family size and spend more on gold than any other product.

Cluster 2

  • This cluster has 564 individuals
  • This cluster is pretty middle of the pack on every product category
  • This cluster purchases the most through deals and web purchases
  • This cluster most closely matches cluster 0 on gold and wine purchases

This cluster seems to represent middle class individuals, they seem to really like to buy high end products (wine and gold) at our location proportionately speaking compared to the other clusters and products.

It seems that Hierarchical Clustering has given us the same breakdown as our K-Means and K-Medoid models. Our dataset seems to be pretty heavily reliant on money spent, which means we are going to break down into clusters that follow low, middle, and high income patterns as they have differing amounts of money to be able to spend.

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

In [82]:
dbscan_df = df_scaled.copy()
In [83]:
# Define the parameter search space
eps_values = np.arange(0.5, 10, 0.5)
min_samples_values = range(2, 25)

# Initialize the best parameters and silhouette score
best_eps = None
best_min_samples = None
best_silhouette = -1

# Perform Grid Search
for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples).fit(dbscan_df)
        labels = dbscan.labels_
        
        # Calculate the silhouette score for the current configuration
        # Ignore the noise points (label -1)
        if len(set(labels)) > 1:
            silhouette = silhouette_score(dbscan_df, labels, metric='euclidean')
        else:
            silhouette = -1

        # Update the best parameters if the current configuration has a better silhouette score
        if silhouette > best_silhouette:
            best_eps = eps
            best_min_samples = min_samples
            best_silhouette = silhouette

# Print the best parameters and silhouette score
print(f"Best eps: {best_eps}")
print(f"Best min_samples: {best_min_samples}")
print(f"Best silhouette score: {best_silhouette}")
Best eps: 7.5
Best min_samples: 2
Best silhouette score: 0.7965250624155766
In [84]:
# Applying DBSCAN with eps as 7.5 and min sample as 2
dbs = DBSCAN(eps = 7.5, min_samples = 2)

# Creating a copy of the original data
df4 = df.copy()

# Add DBSCAN cluster labels to dbscan data
dbscan_df["DBClusters"] = dbs.fit_predict(dbscan_df)

# Add DBSCAN cluster labels to whole data
df4["DBClusters"] =  dbs.fit_predict(dbscan_df)

# Creating the cluster profile
db_cluster_profile = df4.groupby("DBClusters").mean()

# Creating the ValueCount feature for observation purposes
db_cluster_profile["ValueCount"] = (df4['DBClusters'].value_counts())
db_cluster_profile = db_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
db_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[84]:
DBClusters -1 0
Year_Birth 1978.500000 1968.918312
Income 26914.250000 51695.708034
Kidhome 0.500000 0.444345
Teenhome 0.000000 0.508079
Recency 47.500000 49.122531
MntWines 16.500000 305.460952
MntFruits 1.500000 26.425494
MntMeatProducts 1666.000000 164.119838
MntFishProducts 6.500000 37.706014
MntSweetProducts 2.500000 27.201975
MntGoldProds 11.500000 44.234291
NumDealsPurchases 7.500000 2.315530
NumWebPurchases 0.000000 4.105027
NumCatalogPurchases 14.000000 2.628366
NumStorePurchases 0.500000 5.818223
NumWebVisitsMonth 0.500000 5.328097
AcceptedCmp3 0.000000 0.073160
AcceptedCmp4 0.500000 0.074506
AcceptedCmp5 0.000000 0.073160
AcceptedCmp1 0.000000 0.064632
AcceptedCmp2 0.000000 0.013016
Complain 0.000000 0.009425
Response 0.000000 0.149910
ChildTotal 0.500000 0.952424
HouseholdSize 2.500000 2.596499
TotalSpent 1704.500000 605.148564
Age 37.500000 47.081688
TotalPurchases 22.000000 14.867145
TotalOffersAccepted 0.500000 0.448384
AmountPerPurchase 859.616279 32.550464
DaysWithCompany 1031.500000 902.891382
ValueCount 2.000000 2228.000000

This is a horrible clustering. We have 2 clusters, one with 2 datapoints and another with 2228. We will try increasing the minimum samples and finding another model, but it is seeming like DBSCAN doesn't perform well on our dataset.

In [85]:
# Define the parameter search space
eps_values = np.arange(0.5, 10, 0.5)
min_samples_values = np.arange(50, 250, 5)

# Initialize the best parameters and silhouette score
best_eps = None
best_min_samples = None
best_silhouette = -1

# Perform Grid Search
for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples).fit(dbscan_df)
        labels = dbscan.labels_
        
        # Calculate the silhouette score for the current configuration
        # Ignore the noise points (label -1)
        if len(set(labels)) > 1:
            silhouette = silhouette_score(dbscan_df, labels, metric='euclidean')
        else:
            silhouette = -1

        # Update the best parameters if the current configuration has a better silhouette score
        if silhouette > best_silhouette:
            best_eps = eps
            best_min_samples = min_samples
            best_silhouette = silhouette

# Print the best parameters and silhouette score
print(f"Best eps: {best_eps}")
print(f"Best min_samples: {best_min_samples}")
print(f"Best silhouette score: {best_silhouette}")
Best eps: 9.0
Best min_samples: 50
Best silhouette score: 0.7967082671882659
In [86]:
# Applying DBSCAN with eps as 9 and min sample as 50
dbs = DBSCAN(eps = 9, min_samples = 50)

# Creating a copy of the original data
df4 = df.copy()

# Add DBSCAN cluster labels to dbscan data
dbscan_df["DBClusters"] = dbs.fit_predict(dbscan_df)

# Add DBSCAN cluster labels to whole data
df4["DBClusters"] =  dbs.fit_predict(dbscan_df)

# Creating the cluster profile
db_cluster_profile = df4.groupby("DBClusters").mean()

# Creating the ValueCount feature for observation purposes
db_cluster_profile["ValueCount"] = (df4['DBClusters'].value_counts())
db_cluster_profile = db_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
db_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[86]:
DBClusters -1 0
Year_Birth 1978.500000 1968.918312
Income 26914.250000 51695.708034
Kidhome 0.500000 0.444345
Teenhome 0.000000 0.508079
Recency 47.500000 49.122531
MntWines 16.500000 305.460952
MntFruits 1.500000 26.425494
MntMeatProducts 1666.000000 164.119838
MntFishProducts 6.500000 37.706014
MntSweetProducts 2.500000 27.201975
MntGoldProds 11.500000 44.234291
NumDealsPurchases 7.500000 2.315530
NumWebPurchases 0.000000 4.105027
NumCatalogPurchases 14.000000 2.628366
NumStorePurchases 0.500000 5.818223
NumWebVisitsMonth 0.500000 5.328097
AcceptedCmp3 0.000000 0.073160
AcceptedCmp4 0.500000 0.074506
AcceptedCmp5 0.000000 0.073160
AcceptedCmp1 0.000000 0.064632
AcceptedCmp2 0.000000 0.013016
Complain 0.000000 0.009425
Response 0.000000 0.149910
ChildTotal 0.500000 0.952424
HouseholdSize 2.500000 2.596499
TotalSpent 1704.500000 605.148564
Age 37.500000 47.081688
TotalPurchases 22.000000 14.867145
TotalOffersAccepted 0.500000 0.448384
AmountPerPurchase 859.616279 32.550464
DaysWithCompany 1031.500000 902.891382
ValueCount 2.000000 2228.000000

This is still horrible. It seems DBSCAN does not perform well on this dataset.

Gaussian Mixture Model¶

In [87]:
# Creating a copy of the original and scaled dataset
gmm_df = df_scaled.copy()
df5 = df.copy()

# Define the range of k values
k_values = range(2, 11)

# Store the silhouette scores for each k value
silhouette_scores = []

# Loop through the k values and fit GMM models
for k in k_values:
    gmm = GaussianMixture(n_components=k, random_state=1)
    gmm_labels = gmm.fit_predict(gmm_df)

    # Calculate the silhouette score for the current k value
    silhouette_avg = silhouette_score(gmm_df, gmm_labels)
    silhouette_scores.append(silhouette_avg)

# Plot the silhouette scores for different k values
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for different k values of GMM')
plt.show()

This graph seems to follow our K-Means and K-Medoid silhouette graphs, and since we gained more meaningful insights from K=5 there we will try K=5 here as well.

In [88]:
# Applying Gaussian Mixture
gmm = GaussianMixture(n_components = 5, random_state = 1)

gmm.fit(gmm_df)

# Adding labels to the datasets
gmm_df["GMMClusters"] = gmm.fit_predict(gmm_df)
df5["GMMClusters"] = gmm.fit_predict(gmm_df)

# Creating the cluster profile
gmm_cluster_profile = df5.groupby("GMMClusters").mean()

# Creating the ValueCount feature for observation purposes
gmm_cluster_profile["ValueCount"] = (df5['GMMClusters'].value_counts())
gmm_cluster_profile = gmm_cluster_profile.T

# Highlighting the maximum average value among all the clusters for each of the variables
gmm_cluster_profile.style.highlight_max(color="lightgreen", axis=1)
Out[88]:
GMMClusters 0 1 2 3 4
Year_Birth 1970.942308 1972.072273 1966.938697 1966.994129 1967.077572
Income 76542.081731 32186.548620 60662.212644 73821.782779 49277.837268
Kidhome 0.086538 0.851511 0.191571 0.039139 0.445194
Teenhome 0.173077 0.444152 0.704981 0.301370 0.738617
Recency 46.317308 48.254928 45.628352 50.616438 50.973019
MntWines 858.692308 20.131406 614.210728 547.921722 228.801012
MntFruits 62.471154 3.072273 34.934866 66.745597 11.499157
MntMeatProducts 549.269231 12.904074 179.440613 415.455969 72.369309
MntFishProducts 85.836538 4.742444 46.839080 97.859100 15.607083
MntSweetProducts 60.826923 3.323259 42.291188 66.610568 11.264755
MntGoldProds 78.682692 10.272011 80.153257 75.606654 38.822934
NumDealsPurchases 1.076923 1.771353 3.609195 1.475538 3.403035
NumWebPurchases 4.182692 1.633377 6.796935 5.395303 4.952782
NumCatalogPurchases 4.951923 0.318003 4.724138 5.610568 1.731872
NumStorePurchases 6.442308 2.885677 8.609195 8.667319 5.770658
NumWebVisitsMonth 4.086538 6.392904 6.091954 2.757339 6.042159
AcceptedCmp3 0.134615 0.070959 0.157088 0.046967 0.050590
AcceptedCmp4 0.355769 0.003942 0.168582 0.068493 0.080944
AcceptedCmp5 0.586538 0.000000 0.091954 0.150685 0.001686
AcceptedCmp1 0.394231 0.000000 0.107280 0.129159 0.015177
AcceptedCmp2 0.096154 0.002628 0.053640 0.003914 0.001686
Complain 0.009615 0.014455 0.007663 0.007828 0.005059
Response 0.576923 0.077530 0.252874 0.158513 0.114671
ChildTotal 0.259615 1.295664 0.896552 0.340509 1.183811
HouseholdSize 1.865385 2.940867 2.490421 1.990215 2.851602
TotalSpent 1695.778846 54.445466 997.869732 1270.199609 378.364250
Age 45.057692 43.927727 49.061303 49.005871 48.922428
TotalPurchases 16.653846 6.608410 23.739464 21.148728 15.858347
TotalOffersAccepted 2.144231 0.155059 0.831418 0.557730 0.264755
AmountPerPurchase 118.281469 7.973858 42.427881 61.289262 22.731530
DaysWithCompany 951.173077 849.611038 1029.685824 885.743640 922.202361
ValueCount 104.000000 761.000000 261.000000 511.000000 593.000000

This looks like a good clustering. We have sizeable clusters with a variety of different maximum average values among different clusters and different features. There seem to be 3 main clusters and 2 smaller but still recognizeable clusters.

In [90]:
df5.groupby(["GMMClusters", "Marital_Status"])['Year_Birth'].count()
Out[90]:
GMMClusters  Marital_Status
0            Divorced           13
             Married            39
             Single             25
             Together           24
             Widow               3
1            Divorced           74
             Married           297
             Single            178
             Together          194
             Widow              18
2            Divorced           41
             Married            94
             Single             55
             Together           61
             Widow              10
3            Divorced           42
             Married           197
             Single            113
             Together          135
             Widow              24
4            Divorced           60
             Married           234
             Single            115
             Together          162
             Widow              22
Name: Year_Birth, dtype: int64
In [91]:
df5.groupby(["GMMClusters", "Education"])['Year_Birth'].count()
Out[91]:
GMMClusters  Education 
0            Basic           1
             Graduation     49
             Master         23
             PhD            31
1            Basic          45
             Graduation    368
             Master        210
             PhD           138
2            Basic           3
             Graduation    124
             Master         57
             PhD            77
3            Graduation    291
             Master        124
             PhD            96
4            Basic           5
             Graduation    290
             Master        158
             PhD           140
Name: Year_Birth, dtype: int64
In [92]:
df5.groupby(["GMMClusters", "HouseholdSize"])['Year_Birth'].count()
Out[92]:
GMMClusters  HouseholdSize
0            1                 36
             2                 50
             3                 15
             4                  2
             5                  1
1            1                 27
             2                192
             3                359
             4                165
             5                 18
2            1                 35
             2                 95
             3                102
             4                 26
             5                  3
3            1                129
             2                261
             3                118
             4                  3
4            1                 25
             2                162
             3                292
             4                104
             5                 10
Name: Year_Birth, dtype: int64
In [93]:
df5.groupby(["GMMClusters", "TotalOffersAccepted"])['Year_Birth'].count()
Out[93]:
GMMClusters  TotalOffersAccepted
0            0                       23
             1                       20
             2                       16
             3                       15
             4                       24
             5                        6
1            0                      667
             1                       70
             2                       24
2            0                      145
             1                       58
             2                       32
             3                       12
             4                       11
             5                        3
3            0                      322
             1                      118
             2                       48
             3                       21
             4                        2
4            0                      464
             1                      104
             2                       22
             3                        3
Name: Year_Birth, dtype: int64
In [89]:
# Select the features to plot
features_to_plot = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                    'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
                    'Kidhome', 'Teenhome', 'HouseholdSize', 'Complain', 'Age', 'Recency', 'Income', 'TotalSpent', 'TotalPurchases',
                    'TotalOffersAccepted', 'DaysWithCompany', 'AmountPerPurchase']

# Create boxplots for each feature
n_features = len(features_to_plot)
n_cols = 3
n_rows = n_features // n_cols + (n_features % n_cols > 0)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5))
axes = axes.flatten()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(x='GMMClusters', y=feature, data=df5, ax=axes[i])
    axes[i].set_title(feature)

# Remove unused subplots
for i in range(len(features_to_plot), n_cols * n_rows):
    fig.delaxes(axes[i])

plt.tight_layout(pad = 3.0)
plt.show()

We seem to have slightly more variability in these clusters. They potentially tell us a bit more about the patterns and different types of customers, however they do still follow income pretty consistently.

Cluster 0

  • This cluster has 104 individuals
  • This is the smallest cluster
  • This cluster has the highest income and total spent, as well as by far the most offers accepted.
  • This cluster doesn't make that many purchases comparatively though, being relatively middle of the pack compared to the other clusters
  • This cluster has a comparatively low household size to most the other clusters, only comparing to cluster 3
  • They purchase most often from stores and catalogs
  • This cluster spends the most on our highest margin products, meat and wines.

This cluster seems to represent our high income whales who like luxury products. They don't make many purchases but make sizeable purchases of expensive wines and meat when they do.

Recommendation: Their spending pattern makes them an attractive target for marketing campaigns focusing on high-margin products and personalized promotions.

Cluster 1

  • This cluster has 761 individuals
  • This cluster has minimal activity in all the product categories and all the purchase streams
  • This cluster however has the most web visits
  • This cluster has the lowest income
  • This cluster also has the lowest age and days with company
  • This cluster has the lowest amount of purchases

This cluster seems to represent new customers or customers who might not be able to afford our products, but enjoy them and so they purchase them on occasion when they can find something they can afford.

Recommendation: To increase their engagement and spending, the company could consider offering affordable or entry-level products, personalized promotions, or targeted online advertising to appeal to their preferences and financial capabilities.

Cluster 2

  • This cluster has 261 individuals
  • This is our upper middle class cluster income wise (3rd highest out of 5)
  • This cluster proportionately spends more on gold and wine, actually surpassing the 2nd highest earning cluster in those product categories.
  • This cluster uses the most deals and has the most web purchases out of any cluster
  • This cluster also has high amount of purchases on all the other purchase streams
  • This cluster has the most total purchases and has been with the company the longest

This cluster seems to represent our passionate and loyal customers. They love our store and will do a lot of shopping here, however they don't have as much income to spare as our whales or higher income individuals.

Recommendation: The company could benefit from offering targeted promotions and deals on gold and wine products to this group, as well as continuing to provide a seamless online shopping experience. Maintaining strong relationships with these loyal customers and nurturing their long-term engagement with the company will be crucial for retaining their business and maximizing their lifetime value.

Cluster 3

  • This cluster has 511 individuals
  • This cluster has the 2nd highest income of any cluster
  • This cluster purchases through stores and catalog more than other streams actually topping the clusters in these streams, but still purchases on the web a decent amount too
  • This cluster has the highest age
  • This cluster has the 2nd lowest days with company
  • This cluster is 2nd highest in total spent and total purchases
  • This cluster has a low household size comparatively
  • This cluster proportionately spends more on fruits, fish, and sweets than the other clusters, they still spend a good amount on wines and meat, but fall more in line with the general pattern on those features.
  • This cluster spends the most on gold when comparing medians

This cluster seems to represent our most diverse buyers. They purchase across all categories and make a lot of purchases with high amounts spent.

Recommendation: The company should try to market bundles to these customers. Given that these customers are relatively new to the company we also must focus on making sure their experience is as streamlined and enjoyable as possible to build a good rapport and relationship with these customers to encourage long-term engagement and maximize their lifetime value.

Cluster 4

  • This cluster has 593 individuals
  • This cluster represents our lower middle class income wise (2nd lowest out of 5)
  • This cluster has a relatively low total spent
  • This cluster purchases a lot on the web and through deals, they also have a decent amount of store purchases, but they do not use catalog's often
  • Their highest spending categories both amount wise and proportionately are wines and gold
  • They make about an average number of purchases
  • They have a high number of web visits

This cluster seems to represent indifferent buyers. They show up periodically to get a few goods, but they don't have a strong affiliation to our brand and they don't currently spend much.

Recommendation: To cater to this group, the company could focus on enhancing their online shopping experience and offering attractive deals, especially for wines and gold products. It's essential to consider their lower-middle-class income and design promotions that appeal to their budget while still providing good value. Since they make an average number of purchases, targeted marketing and personalized recommendations could encourage them to shop more frequently and increase their overall spending with the company.

Final Comparisons and Analysis¶

1. Comparison of various techniques and their relative performance based on silhouette score (Measure of success):

In [95]:
# Calculating K-Means Silhouette score at K = 3
kmeans = KMeans(n_clusters = 3, random_state = 1)       

prediction = kmeans.fit_predict((df_scaled))                   

s_score = silhouette_score(df_scaled, prediction)               

print(s_score)         
0.27186197643097576
In [97]:
# Calculating K-Medoids Silhouette score at K = 5
kmedoids = KMedoids(n_clusters = 5, random_state = 1)       

prediction = kmedoids.fit_predict((df_scaled))                  

s_score = silhouette_score(df_scaled, prediction)                

print(s_score)         
0.1077092526484127
In [96]:
# Calculating Agglomerative Clustering Silhouette score with Euclidian distance and ward linkage with Clusters = 3
HCm = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')       

prediction = HCm.fit_predict((df_scaled))                  

s_score = silhouette_score(df_scaled, prediction)                

print(s_score)        
0.21913220436763914
In [98]:
# Calculating Gaussian Mixture Silhouette score at K = 5
GMM = GaussianMixture(n_components = 5, random_state = 1)       

prediction = GMM.fit_predict((df_scaled))                  

s_score = silhouette_score(df_scaled, prediction)                

print(s_score)        
0.14694315739206412

In terms of Silhouette score K-Means with K = 3 is providing the best result. However this didn't really provide us with any informative clusters. All our silhouette scores are quite low which means our dataset doesn't cluster super well, at which point we will choose the clustering technique that provided us with the most useful insights rather than the one with the highest silhouette score.

Note: We will also take insights that we gained from the initial exploratory data analysis and the less successful models as there are still some interesting patterns and recommendations to be gained there.

2. Refined insights:

  • Web visits negatively correlate with purchases of all product types and streams, and lower-income individuals tend to use the web the most, Perhaps the marketing campaign should put more emphasis on web advertisements, and specifically deals or affordable products. This could convince the low income individuals to make more purchases with our business.

  • The correlation between meat and wines, the fact that these are our highest earning products, and that higher income individuals like these products. We should advertise meat and wines in the catalog's more as that is where high income individuals like to order and they spend large amounts there, maybe put these front and center. We can also move these to the front of the store to make them more noticeable to individuals who enter.

  • Gold is a product category that is typically seen as high status, but proportionately is being purchased less than wines and meats, our data science team should look into this to see if we could profit by perhaps bundling or advertising these to high income individuals or certain clusters, or if there is some important reason as to why gold isn't being purchased as much as we might think here. We need more insights into this product.

  • Our dataset is highly monetary based. If we look into gathering data surrounding product sales numbers rather than amount spent we could perhaps determine more informative clusters and customer segments.

3. Proposal for the final solution design:

The model that gives us the best insights is our Gaussian Mixture model with 5 components (K = 5). Here we see 3 main clusters and 2 secondary clusters. For our marketing campaign we will utilize the 5 segments from this model to create recommendations. We will also take into account information from general demographic observations from the overall case study to make recommendations to the data science team and the data collections team.

The 5 clusters from the GMM model are as follows:

  • Cluster 1: High-spending "whales" who are highly responsive to previous campaigns and prefer high-margin products like meats and wines.
  • Cluster 2: New Low-engagement, low-spending customers who show interest in the company through high web visits but minimal purchasing activity across all categories.
  • Cluster 3: Loyal and diverse buyers who have a long-term relationship with the company and actively purchase through various channels and product categories.
  • Cluster 4: High-income customers who are diverse in their product choices, buying all our products consistently, they prefer traditional shopping channels like stores and catalogs.
  • Cluster 5: Budget-conscious and indifferent customers who primarily shop online and through deals, focusing on wines and gold products as their preferred categories. They don't currently care much for our brand.

In the report below we will provide detailed recommendations for our marketing team, our data science team, and our data collections team on how to proceed regarding these clusters and the data we've gathered as well as what future data to gather.

Final Report¶

Executive Summary:¶

This project proposes a Gaussian Mixture Model (GMM) for customer segmentation in a retail context, focusing on understanding demographics, product preferences, and purchasing channels to inform marketing strategies and data-driven decisions. Our analysis identified five distinct customer segments (three primary segments and two secondary), providing valuable insights to target specific customer groups and optimize marketing efforts across various channels.

Based on the segmentation, we recommend that stakeholders focus on web advertisements and affordable products for lower-income customers, emphasize high-margin products like meats and wines in catalogs and stores for high-income customers, and investigate the purchasing habits surrounding the gold product category. In addition, we advise refining and expanding the GMM model and investing in new data collection (such as number of products sold per consumer in each category) to further enhance accuracy and usefulness. By tailoring marketing campaigns to the identified customer segments, the company can better serve its diverse customer base and maximize profits.

Problem Summary:¶

The retail industry faces challenges in understanding and catering to diverse customer preferences, purchasing habits, and demographics. As competition grows, retailers need to develop targeted marketing campaigns to attract and retain customers effectively. The key objective of this project is to analyze customer data and identify distinct customer segments, enabling the retail company to better understand its clientele and create tailored marketing strategies to enhance customer engagement and increase revenue.

Our analysis focuses on customer demographics, product preferences, and purchasing channels, exploring various clustering methods to identify the most effective model for segmenting the customer base. These insights will enable the company to optimize marketing efforts across various channels, targeting specific customer groups based on their unique characteristics. The resulting segmentation and recommendations aim to support the company in refining its marketing campaigns, ultimately leading to improved customer satisfaction and increased profits.

Solution Design:¶

To address the challenge of identifying distinct customer segments, various clustering techniques and models were explored, including K-means, K-medoids, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Model (GMM). The process involved preprocessing the data, scaling, and reducing dimensionality using PCA. The optimal number of clusters was determined using the silhouette score and other evaluation metrics, ensuring a proper balance between the diversity of the clusters and the model's performance.

The final proposed solution is the Gaussian Mixture Model with five components, as it provided the most meaningful and diverse customer segments. These clusters enabled us to analyze customer demographics, product preferences, and purchasing channels more effectively. Our analysis revealed insights into customer behavior that could be leveraged to create tailored marketing strategies and enhance customer engagement.

While the GMM model had a silhouette score of 0.15 as compared to the highest K-Means model score of 0.27, the chosen GMM model is valid and optimal as it offers a higher level of flexibility in terms of cluster shape and size compared to other clustering techniques. Furthermore, it allows for a probability-based approach to customer segmentation, which can provide additional insights into customer behavior. This solution design directly impacts the retail business by enabling the company to optimize its marketing efforts, target specific customer segments, and ultimately improve customer satisfaction and increase revenue. By understanding and addressing the unique needs of each customer group, the retailer can make informed decisions to cater to their preferences and shopping habits, resulting in a more effective marketing strategy.

The 5 clusters from the GMM model are as follows:

  • Cluster 1: High-spending "whales" who are highly responsive to previous campaigns and prefer high-margin products like meats and wines.
  • Cluster 2: New Low-engagement, low-spending customers who show interest in the company through high web visits but minimal purchasing activity across all categories.
  • Cluster 3: Loyal and diverse buyers who have a long-term relationship with the company and actively purchase through various channels and product categories.
  • Cluster 4: High-income customers who are diverse in their product choices, buying all our products consistently, they prefer traditional shopping channels like stores and catalogs.
  • Cluster 5: Budget-conscious and indifferent customers who primarily shop online and through deals, focusing on wines and gold products as their preferred categories. They don't currently care much for our brand.

Marketing Campaign Recommendations:¶

1. Targeted marketing for each cluster:

  • Cluster 1 (High-spending "whales"): Focus on promoting high-margin products, like meats and wines, through catalogs and personalized offers. Put these products prominently in store locations and on front pages of catalogs.
  • Cluster 2 (New low-engagement, low-spending customers): Increase web-based advertising, emphasizing deals and affordable products to convert web visits into purchases.
  • Cluster 3 (Loyal and diverse buyers): Offer loyalty rewards and exclusive promotions to maintain and enhance their engagement with the company.
  • Cluster 4 (High-income diverse buyers): Prioritize bundling diverse product choices through traditional shopping channels, such as catalogs and in-store promotions.
  • Cluster 5 (Budget-conscious and indifferent customers): Highlight online deals and promotions for wines and gold products, emphasizing value for money.

2. Key actionables for stakeholders:

  • Develop tailored marketing campaigns for each customer segment based on their preferences and purchasing habits.
  • Optimize product placement and visibility in stores and catalogs to align with the preferences of high-income customers.
  • Collaborate with the data science team to continuously refine customer segmentation and gather insights on customer behavior.

3. Expected benefits and costs:

  • Improved customer satisfaction and increased revenue through targeted marketing campaigns.
  • Enhanced customer engagement and loyalty by addressing their specific needs and preferences.
  • Initial costs associated with the development and implementation of tailored marketing campaigns and in-store adjustments.

4. Key risks and challenges:

  • The risk of oversimplifying customer behavior and missing out on opportunities for cross-selling and upselling.
  • Ensuring that tailored marketing efforts do not alienate other customer segments.
  • Adapting to changes in customer preferences and behavior over time, which may necessitate updates to the clustering model and marketing strategies.

By implementing these marketing campaign recommendations, the retail business can expect to see improved customer satisfaction, increased revenue, and stronger customer engagement. However, stakeholders must remain vigilant to potential risks and challenges, adapting their strategies as needed to maintain the effectiveness of their marketing efforts.

We must remain aware of the possibility of customers' preferences evolving over time or not responding as expected to the tailored marketing campaigns. Additionally, there may be challenges in accurately targeting the right customer segments due to the probabilistic nature of the GMM clustering. To mitigate these risks, the company should continually monitor customer behavior, update customer segmentation, and adjust marketing strategies accordingly.

Further Analysis and Data Collection Recommendations:¶

1. Further analysis on the Gaussian Mixture Model (GMM):

  • Investigate different initializations, GMM parameters, and preprocessing techniques to refine and enhance the accuracy of the clustering model.
  • Explore other clustering algorithms and compare their performance to the current GMM implementation.
  • Apply the existing model on new data as it is received.
  • Create scatterplots of certain features and the clusters to further explore the chosen GMM Model.

2. Expand data collection to include new features:

  • Collect data on product sales quantity instead of solely focusing on the monetary value to help identify more informative clusters and customer segments as the current clusters are heavily influenced by income.
  • Explore other features that could be gathered through surveys or other means such as total number of stores consistently shopped at.

3. Investigate the gold product category further:

  • Analyze the reasons behind the lower-than-expected sales of gold products among high-income customers and explore strategies to increase its appeal and sales.

4. Monitor changes in customer behavior over time:

  • Regularly update the clustering model to account for changes in customer preferences and behavior, ensuring marketing strategies remain effective and relevant.

5. Assess the impact of external factors on customer behavior:

  • Consider the effects of macroeconomic factors, competitor strategies, and changes in industry trends on customer behavior to better inform marketing strategies.

By addressing these recommendations, the data science team can refine the current clustering model, gain deeper insights into customer behavior, and identify new opportunities for growth. This will also help the retail business to adapt its marketing strategies to account for changes in customer preferences and external factors, ensuring long-term success.